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Message 



B ecause of the strict budgetary 
measures instituted a year ago, 
IEEE Micro should show a mod¬ 
est profit in 1986. Even so, this was a 
pretty tough year for most trade maga¬ 
zines. During this ongoing malaise in the 
computer and semiconductor industry 
we lost about 1500 readers. We were not 
hurt by industry cutbacks in advertising 
because we simply did not have much to 
lose. 

To ensure future health of the maga¬ 
zine, increases in our page count, and— 
as many readers are requesting—a move 
toward a monthly magazine, we need to 
increase our readership and attract more 
advertising. I do not foresee IEEE Micro 
becoming awash in ads, as some maga¬ 
zines are, but ads do have value to the 
reader as well as being revenue sources 
for the magazine. To help expose IEEE 
Micro to a wider potential audience and 
thereby increase readership, we will be 
offering “special” introductory 
subscriptions, forming alliances with 
conferences, and actively pursuing 
subscribers wherever we can. In the pro¬ 
cess of becoming more visible, we 
should become more attractive to adver¬ 
tisers. I need your help. If you have a 
suggestion to offer, or know of a group 
or event that may be a candidate for our 
new subscriber campaign, please let me 
know. If you or your employer are con¬ 
sidering an advertisement, please give me 
a call, and I will put you in touch with a 
representative. 

The “special theme” issues of 1986 
have been a resounding success, judging 
from the mail and telephone messages I 
have been receiving. This issue on digital 
signal processing edited by Bob Morris 
closes out our 1986 editorial calendar. I 
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From the 

Editor-in-chief 


1986 Report 


would be remiss if I did not publically 
thank all of our 1986 guest editors for a 
splendid job: 

June 1986, Arthur R. Miller and 
Rhonda Alexis Dirvin (networking), 

August 1986, Victor K. L. Huang and 
Priscilla M. Lu (operating systems), 

October 1986, Barry W. Johnson 
(multiprocessing), and 

December 1986, L. Robert Morris. 

We have a very healthy backlog of ex¬ 
cellent articles in addition to the 1987 
special theme issues. We shall reduce this 
queue by publishing four or five articles 
in the February 1987 issue. TRON 
comes in April. 

I must admit that I am becoming very 
excited about the TRON issue. During 
the past two weeks major news articles 
concerning TRON have appeared not 
only in Electrical Engineering Times and 
Electronic News but also in the Wall 
Street Journal. Ken Sakamura, of your 
IEEE Micro editorial board, was named 
as the prime device and system architect 
in all three publications. TRON stands 
for The Realtime Operating System 
Nucleus and will be thoroughly covered 
by Sakamura in April. 

The mailbag was a little light this 
time, containing only 20 comment cards. 
As usual, I summarize the comments: 

“I would like to subscribe,” P.A., 
Netanya, Israel (A good start—J.F.) 

“I like tutorial-type articles, even if they 
are blatantly motivated by commercial 
interest.” A.B., Manchester, NH 
“...would like to see more digital signal 
analysis.” E.A., McLean, VA 
“...Benchmark article was good but 
(omitted) operating systems-type opera¬ 
tions.” K.S., Lexington, MA 


“I liked VRTX...more academic au¬ 
thors.” T.F., Keene, NH 
“Another good issue...I don’t feel lack 
of ads is very bad, although I’m sure it 
would be nice for revenue, though!” 

B. H., Albuquerque, NM 

“All articles should start on right-hand 
page to simplify rip-and-save.” Anon. 
(Ouch! Would you consider saving the 
whole magazine?— J.F.) 

“I liked everything. Just go on! More 
comparisons of chips and architectures.” 
P.L., Rio de Janeiro, Brazil 
“VRTX was fascinating.” S.S., Falls 
Church, VA 

“I liked the looks of all the articles... 
IEEE Micro is on the top of the pile.” 

C. T., Cropwell, AL 

“Overall, (this was) an excellent issue 
(networking).” J.B., Middletown 
Springs, VT 


F inal notes: Henriecus Koeman has 
completed his terms on your editorial 
board and leaves this month. In 
addition to his excellent editorial review¬ 
ing work, Henriecus provided valuable 
ideas in our refereeing process, which I 
have since implemented. Thank you, 
Henriecus. 

To those readers observing Christmas 
or Chanukah, we wish a joyous holiday 
season. To all of our readers, we wish a 
New Year filled with health, peace, pros¬ 
perity, and the benefits of technology. 


Jim Farrell 
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Letters to the Editor 


Benchmark article stirs reader response 


To the Editor: 

With regard to benchmarks—yes, yes, 
YES. There are few benchmarks with 
any meaning that get published; that is, 
most of them are in somebody’s adver¬ 
tising literature and are, shall I say, 
suspect. I found the article in the August 
issue of IEEE Micro very interesting. 

One slight problem with that article 
was the final summary of results; the 
authors took an arithmetic mean when 
they should have taken a geometric 
mean. Their main conclusions are the 
same, but the advantage of the 68020 


Thayne Cooper replies 

We wish to thank all of you who read 
our 32-bit microprocessor benchmark ar¬ 
ticle. We hope the information was 
useful as well as interesting to many of 
you. 

The letter by Bruce Walker addresses 
an issue which is always of interest in an 
article like ours: How should the infor¬ 
mation be presented? We read the article 


To the Editor: 

While we at National Semiconductor 
Corporation were pleased that one of the 
members of our Series 32000 micropro¬ 
cessor family was included in the article, 
“A Benchmark Comparison of 32-bit 
Microprocessors” (IEEEMicro, Aug. 
1986, p. 53), we were disappointed to 
find that the microprocessor tested was 
the 10-MHz NS32032 and not the newest 
member of the family, the 15-MHz 
NS32332. 

This MPU, which is both fully upward 
and downward software compatible with 
the other members of the Series 32000 
family and has such speed-increasing 
features as an improved microinstruction 
set, larger instruction prefetch queue, 
and burst access mode, has been avail¬ 
able as a 10-MHz part since October of 
1985, with the 15-MHz version available 
since March of 1986. 

Our own in-house executions of the 


over the other processors is not quite as 
great as it appears from their numbers. I 
have enclosed the correct version of their 
Table 3, along with an article from the 
CACM showing why this is the correct 
method. (Editor’s note: The CACM arti¬ 
cle is not reprinted here because of its 
length; see Comm. ACM, Vol. 29, No. 

3, Mar. 1986, pp. 218-221.) 


Bruce Walker 
San Pedro, CA 


referenced by Walker with interest 
and suggest that others read it also. 

We do note, however, that the article 
referenced by Walker treats the use of 
the geometric means on normalized 
numbers of individual tests. The sum¬ 
mary we provided did not use normal¬ 
ized test numbers. Rather, it took the ar¬ 
ithmetic means of the raw unnormalized 


EDN benchmarks, run on National 
Semiconductor’s NS32032-10-based 
DB32000 demonstration board, have 
produced slightly faster results than were 
shown in the article, due to our use of 
zero-wait-state RAM. Execution of the 
same benchmarks on our DB332 
demonstration board, utilizing the 
15-MHz NS32332 and zero-wait-state 
RAM, have shown an overall speed in¬ 
crease, as determined from the mean of 
the test times, of approximately 2.2 
times. 

National Semiconductor is working 
with the authors of the article to ensure 
that the NS32332-15 is included in their 
next round of benchmark tests and looks 
forward to results which will clearly 
show the architectural improvements of 
the NS32332. 

David Raulino 
Santa Clara, CA 


Table 3. 

Revised using geometric means. 



Dynamic 

Static 


memory 

memory 

Processor 

Mean Ratio 

Mean Ratio 

80286 

9.91 

3.55 

5.40 

2.78 

80386 

4.91 

1.76 

2.44 

1.26 

68000 

12.87 

4.61 

8.39 

4.32 

68020N 

5.41 

1.94 

2.67 

1.38 

68020C 

2.79 

1.00 

1.94 

1.00 

32032 

10.26 

3.68 

8.37 

4.31 

32100N 

9.00 

3.23 

4.73 

2.44 

32100C 

4.18 

1.50 

3.22 

1.66 


numbers and then normalized those 
means. 

We answered the question of informa¬ 
tion presentation by supplying the raw 
data of the tests along with a summary. 
This provides the opportunity for other 
kinds of summaries to be made. Walker 
has done that and arrived at essentially 
the same results. 


Cooper’s response 

Since our report in the August issue of 
IEEE Micro on 32-bit microprocessor 
performance, our evaluations have con¬ 
tinued. The National 32332 (follow-on to 
the 32032) has been received and the 
same benchmarks run on it. The follow¬ 
ing are the numbers which can be added 
to the tables found in the August article: 


Table 1. 

32332 (15) 8.45 8.73 4.34 6.43 7.41 


Table 2. 

32332 (15) 6.92 6.26 2.42 3.79 6.15 


Table 3. 

32332 (15) 7.07 2.49 5.11 2.57 


4 


IEEE MICRO 




























Bus wars 


To the Editor: 

A word of thanks to Bob Stewart for 
his very kind reference to me in his arti¬ 
cle in the August IEEE Micro. Also, 
Mildred and I were very amused at Bob’s 
. .Bus Wars” cartoons—this is an ac¬ 
complishment we’d never suspected. 

We still have very pleasant memories 
of the Oregon expedition, and especially 
of that final Saturday we spent together. 
We hope all is well with Bob and that 
there’ll be another opportunity to get 
together before we’re too old and grey. 

Matthew Taub 


To the Editor: 

The so-called “history” of microcom¬ 
puter bus standards development pre¬ 
sented in cartoon form in the August is¬ 
sue (IEEEMicro, “MicroStandards: 
Promises, promises, promises,” p. 66) 
grossly misrepresents at least those events 
with which I am personally familiar. This 
is particularly egregious in that I know 
author Stewart to be well aware of the 
facts and the correct information was 
readily available to your editors (see, 
e.g., the P896 PAR, IEEE Micro, Vol. 

1, No. 1., p. 67; or Wescon/81 Profes¬ 
sional Program Session Record 27/5). 

To cite a few specific examples: 

Cash Olsen had absolutely nothing to 
do with the formation of the P896 work¬ 
ing group. Olsen was the chairman of a 
subgroup (the Futurebus subcommittee) 
formed within the P696 project in mid- 
1978. This subgroup, of which I was a 
member, was established to identify fu¬ 
ture microprocessor bus requirements in 
general, and not at Olsen’s behest (panel 
10). In June 1979, frustrated by the lack 
of progress within this subcommittee, 1 
organized the study group on advanced 
microcomputer system buses that pre¬ 
pared the project authorization request 
resulting in P896. The P896 activity 
never met in Berkeley (panel 14). Note 
also that the misleading use of the term 
Futurebus for the P896 activity began at 
least three years after its inception. 

The recognition of Taub’s develop¬ 
ment of the arbitration mechanism 
(panel 11) was, in fact, quite belated. 
Gustavson presented the concept to P896 
without attribution and long before it 
became known that it had originated 
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with Taub. It is my impression that it 
was the publicity surrounding its use in 
P896 that drew attention to the fact that 
it had actually originated with Taub. 

The serial intermodule communication 
concept—one of the things most violent¬ 
ly objected to by opponents of the early 
P896 drafts, but which has since been in¬ 
corporated into both the VMEbus and 
Multibus II—originated with Rollie Lin- 
ser, and the attribution to me in panel 15 
is quite simply false, as is the one in 
panel 17. Stewart did not attend the 
Boulder meeting, but I specifically ad¬ 
vised him of Linser’s contribution when 
he telephoned me early this year seeking 
background material for the original pre¬ 
sentation of the material in question. 
There was, of course, ample opportunity 
to verify the other information at that 
time. Jean-Daniel Nicoud of EPF- 
Lausanne and the European participants 
whose activities he coordinated until 
1982 also made many valuable contribu¬ 
tions, including the low-cost, single¬ 
connector approach since adopted by 
Nubus. 

The end of Nicoud’s and my ac¬ 
tive participation in the P896 effort 
(panel 19) is also misrepresented. In 
December 1981 (not 1980 as shown), the 
P896 working group voted—16 in favor 
and five (three of whom failed to meet 
the requirement to state the modifica¬ 
tions necessary to cause them to change 
their vote) opposed—to take Draft 4.1.1 
to the MSC with a request to approve 
distribution for public comment. Con¬ 
trary to the impression given in Stewart’s 
comic strip, the rejection of this request 
by the other members of the MSC at its 
January 14, 1982, meeting was not 
unanimous, and Nicoud was not 
even present. 

As was made clear by my statement at 
the time, a copy of which was provided 
to the then chairman of the MSC but 
misrepresented in the minutes of the 
meeting, my resignation was brought 
about not by the denial of the working 
group’s request, but by the committee’s 
general failure to prevent the holders of 
minority viewpoints from indefinitely 
delaying progress. The result, as I 
predicted then, has been a failure to 
achieve results commensurate with the 
enormous amount of effort that has 
been devoted by working groups over the 
past nine years. In the case of P896, for 



example, the MSC has manifestly failed 
in the objective stated in the PAR, 
namely to provide an alternative to the 
development of yet another generation 
of incompatible de facto bus standards, 
namely, those previously mentioned plus 
VAXBI. 

Judging by the quality of the reporting 
of events with which I am familiar, the 
material in question may possibly belong 
in the funny pages, but clearly has no 
place in a publication of the IEEE. Kind¬ 
ly consider this a request for a public 
retraction. 

Andrew Allison 
Los Altos Hills, CA 


Response from Robert G. Stewart 

Ko Ko: Your Majesty, it’s like this: 

It is true that I stated that 1 
had killed Nanki-Poo— 
Mikado: Yes, with most affecting 

particulars. 

Pooh Bah: Merely corroborative detail 
intended to give artistic 
verisimilitude to a bald 
and. . . 

The Mikado, 
W. S. Gilbert and Arthur Sullivan 

Andrew Allison devoted hard and 
constructive efforts for years to the stan¬ 
dardization activities of the IEEE Com¬ 
puter Society. He was the first Micro- 
Standards editor of IEEE Micro. I 
sincerely apologize if he felt affronted by 
the material in my August column, “A 
Historical? View of IEEE Standards 
During the Great Bus Wars.” 

His letter brings up several aspects of 
the 896 Futurebus project, some of 
which I agree with, and some which I 
don’t. Cash Olsen was dubious of the 
value of standardizing the S-100 MITS 
Altair bus (now IEEE 696). The talk of 
Tony Pietsch to the Microprocessor 
Standards Committee emphasized the 
need for more advanced buses. That led 
to the initiation of a study group with 
Cash as chair of the study group acting 
under the 696 PAR to look at other 
possible bus efforts. He looked at a 
“Home Bus” that never went very far 

continued on page 82 
5 




Feature 


Guest Editor’s Introduction 

Digital Signal Processing 
Microprocessors: 
Forward to the Past? 

L. Robert Morris 

Carleton University and DSPS Inc., Ottawa 


I n 1986, a number of new digital signal processing 
microchips were announced by semiconductor 
manufacturers whose names are familiar to IEEE 
Micro’s readers: Analog Devices, Motorola, National 
Semiconductor, NEC, Texas Instruments, and Philips/ 
Signetics. What are DSP micros? What makes them dif¬ 
ferent from the general-purpose micros offered by many of 
the same companies? What are their applications? Why 
should IEEE Micro’s readers be interested in such devices? 
This brief introduction and the articles which follow will 
attempt to answer most of these questions. 

DSP micros share one feature: speed in the execution of 
certain algorithms. As was first noted in a 1983 survey, 1 
these processors are, in effect, reduced-instruction-set com¬ 
puters optimized for the fastest possible execution of addi¬ 
tion, subtraction, multiplication, and shifting instructions. 
In early DSP micros especially, a reduced instruction set, 
which can be implemented in a small area of silicon, is 
accompanied by single-cycle multiplication and shifting, 
which are accomplished by devoting a relatively large area 
of silicon to an array multiplier and a barrel (or com¬ 
binatorial) shifter. In contrast, most current general- 
purpose micros still effect such operations via multiple- 
cycle, microcoded instructions that make use of the 
arithmetic unit’s single-cycle, parallel-add and single-bit 
shift capability. Since integer multiplication and shifting are 
statistically unimportant for most programs that run on 
general-purpose micros, designers of such devices prefer to 
devote large areas of silicon to implementation of larger, 
more versatile instruction sets (sometimes including 
floating-point in the on-chip microcode), memory manage¬ 
ment, or cache memories. 

The implication of the above—that fast integer multipli¬ 
cation and shifting are considered crucial to digital signal 
processing software—is correct. In fact, the software im¬ 


plementation of the most common digital signal processing 
algorithm, an «-tap finite-length impulse response (FIR) 
filter, essentially consists of n multiply/accumulates. These 
instructions are executed once for every signal sample that 
is input (at rates typically of 8 kHz and above). Most of the 
newer DSP micros can accomplish each multiply/accumu- 
late in a single cycle of about 100 ns! This is one to three 
orders of magnitude faster than most general-purpose 
micros. For example, a 16-MHz 80386—a state-of-the-art 
micro which effects register-to-register 16-bit addition 
(ADD) in only 125 ns—requires about 1250 ns for a 
16 x 16-bit multiplication (IMUL), and a 5-MHz 8088 re¬ 
quires 32,000 ns for the same instruction! Other important 
DSP algorithms—the fast Fourier transform, or FFT, for 
example—require many more addition/subtractions than 
multiplications, but even for these algorithms the relatively 
slow multiply on general-purpose processors represents a 
significant bottleneck. 

The first DSP micro, the Intel 2920, appeared nearly a 
decade ago. It was followed by the AMD 2811, the NEC 
f<PD7720, and, in 1982, the Texas Instruments TMS32010. 
While the 2811 and 7720 both had on-chip array mul¬ 
tipliers, both were ROM-programmable only and had 
relatively small data and program address spaces. The 
32010 was the first DSP micro that could execute instruc¬ 
tions at full speed from an off-chip program RAM, and it 
could also accommodate a program nearly an order of 
magnitude larger than the 7720 could. 

The articles in this issue will reveal both similarities and 
differences between DSP and general-purpose micros. For 
example, DSP micros employ many speed- and efficiency- 
related design strategies also employed in regular micros: 
pipelining of instructions, use of addressing modes that ef¬ 
ficiently access relevant data structures (e.g., autoincrement 
and autodecrement modes for arrays and an indexed ad- 
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Table 1. 

Complex FFT performance (times in ms). 

Fixed-point Floating-point 



Number of points 

Number of points 


64 

256 

1024 


1024 

1976 DSP 




1976 DSP 


SPS61 

0.21 

1.10 

4.89 

AP-120B 

4.75 

SPS 21 

1.08 

2.52 

8.13 

MAP-100 

60.00 

1976 general-purpose 




1976 general-purpose 


PDP-11/55 

4.22 

20.00 

118.00 

PDP-11/55 

168.00 





with FP-11C 


1986 DSP (HUm 




1986 DSP 


DSP56000 (20 MHz) 

0.140 

0.71 

4.99 

M PD77230 

12.50 

TMS320C25 (40 MHz) 

0.217 

1.22 

7.10 

(15 MHz) 


ADSP2100 (32 MHz) 

0.319 

1.52 

7.19 



TMS32020 (20 MHz) 

0.434 

2.44 

14.18 



LM32900 (20 MHz) 

0.550 


13.40 



5010/11 (8 MHz) 


3.30 

33.00 



TMS32010 (20 MHz) 

0.535 

6.30 

30.00 



1986 general-purpose 




1986 general-purpose 


80386 (16 MHz) 

1.667 

9.50 

50.00 

80386/387 

100.00 

80286 (8 MHz) 

3.800 

20.00 

110.00 

(16 MHz) 

(est.) 

8086 (8 MHz) 

11.000 

55.00 

310.00 



8088 (5 MHz) 

19.000 

110.00 

610.00 

8088/87 

995.00 


(5 MHz) 

Benchmark sources: 

• DSP56000, fiPD77230, ADSP2100 —IEEE Micro, Dec. 1986, 

• LM32900, 5010—information from manufacturers, 

• PDP-11/55, TMS32010, 80386, 80286, 8088, 8086, 8087, 80387—DSPS Inc., 

• TMS32020—TI’s Details on Signal Processing, Nov. 1985, and 

• TMS320C25—extrapolated from TMS32020. 

Smaller FFTs are often proportionally more efficient than larger ones because 

• in-line code can often be used for n = 64 points or less, and 

• smaller FFTs contain a larger fraction of more efficient, faster, zero-multiply butterflies. 

The FFT is a representative and realistic DSP benchmark since it contains a mixture of multiplications and additions and its computa¬ 
tional kernels (the butterflies) are of nontrivial program complexity compared to the single instruction required for FIR filters on many 
DSP micros. 


dressing mode for FFTs), and use of “clean” subroutine 
calling and address passing protocols. Differences include 
DSP micros’ use of the dual-bus Harvard architecture, 
which enables simultaneous fetching of instructions and 
data; special DSP-related addressing modes (e.g., index 
computation modulo an arbitrary number, automatic cir¬ 
cular queue or free data move for FIR filters, and bit rever¬ 
sal for FFTs); extra addressing ALUs; and special interfaces 
to serve specific fields of application (e.g., serial interfaces 
for codecs in telecommunications). 

How were DSP algorithms implemented in the pre-DSP 
micro era? In the early 1970’s, array processors—first fixed- 


point and then floating-point—were available for real-time 
execution of many audio-bandwidth DSP algorithms. These 
machines varied in cost from $10,000 to $50,000 and 
typically consisted of a rack-mounted unit weighing up¬ 
wards of 100 lbs. and consuming about a kilowatt of 
power. These attributes generally limited the use of array 
processors to large laboratories and certainly precluded the 
inclusion of such machines as subcomponents in OEM sys¬ 
tems. The data in Table 1 reveals an interesting fact; the 
1976 array processors and the 1986 DSP micros have com¬ 
parable FFT execution times, and the same holds for 1976 
general-purpose minicomputers and 1986 general-purpose 
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micros! Thus, today’s DSP and general-purpose micros ex¬ 
hibit approximately the same performance as their decade- 
old ancestors. Further, comparisons of their architectures 
do not reveal that any startling changes have occurred since 
1976. 

What has occurred, of course, is a three-order-of- 
magnitude reduction in cost, size, weight, and power con¬ 
sumption. It is the combination of 1976 array processor 
performance with 1986 microchip attributes that has both 
quantitatively and qualitatively changed the extent to which 
theory can be applied to the practical solution of problems 
in signal processing, communications, and control, and in 
new disciplines such as artificial intelligence. In many cases, 
the computer simulation traditionally carried out as a 
precursor to system realization via hardwired logic can now 
become the cost-effective implementation via software on a 
DSP micro. 

DSP micros are now on the verge of surpassing their ar¬ 
ray processor ancestors in architectural complexity and 
sophistication as well as in performance. Thus, the theme 
finally becomes “Forward to the Future.” VLSI allows ac¬ 
tive device densities and signal propagation times not possi¬ 
ble a decade ago. And, fortunately, semiconductor 
technologies have not yet hit a “brick wall” in terms of 
speed. Gallium arsenide (GaAs) transistors and high- 
electron-mobility transistors (HEMTs) in particular suggest 
that another “easy” order-of-magnitude improvement in 
performance is not unreasonable to anticipate, even with 
existing architectures. Although DSP devices having paral¬ 
lel and dataflow architectures have appeared, at present 
they have not achieved the user acceptance of more conven¬ 
tional “sequential” processors. This is partially due to the 
fact that the present DSP micro user anticipates that per¬ 
formance enhancements requiring neither changes to algo¬ 
rithms nor even changes to software will continue to appear 
due to clock-speed-related semiconductor progress alone! 

We should discuss one other possible scenario. 1 Note 
that the fastest general-purpose micros already approach 
the performance of the slowest DSP micros: a 16-MHz 
80386 computes a IK, complex, fixed-point FFT only 66 
percent slower than a 20-MHz TMS32010 (see Table 1 
again). With the newest versions of general-purpose micros 
already incorporating a DSP-like dual-bus architecture (for 
example, the Motorola 68030 2 ), the obvious next step—in¬ 
tegration of an array multiplier and a barrel shifter into 
general-purpose micros—cannot be far off. Since these two 
devices make possible fast floating-point multiplication and 
addition, respectively, and since floating-point performance 
“sells,” most semiconductor manufacturers are on the 
verge of taking this step. 

Although the resulting general-purpose micros will still 
lack some special instructions and architectural attributes 
that help in achieving maximum DSP performance, it is en¬ 
tirely conceivable—with GaAs technology already commer¬ 
cially viable in 1986 3,4 —that by incorporating GaAs/HEMT 
transistors they can achieve a performance of 100 MIPS and 
upwards and make special-purpose DSP micros unnecessary 
in many DSP applications. 
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Announcing new PC SIMSCRIPTII.5 
... with animation 



SIMSCRIPT II.5 with animation now on personal computers 


free trial--see how 

SIMSCRIPT II.5 helps you build a realistic model 

the complete 


SIMSCRIPT II.5 on a PC 

SIMSCRIPT II.5 for personal 
computers is the same popular 
simulation language that is now wide¬ 
ly used on mainframes. 

You can now build realistic models 
of military, manufacturing, communi¬ 
cations, logistics, transportation or 
other systems on your PC. 

PC SIMSCRIPT includes a new 
programming environment that makes 
it easy for you to develop, verify, 
modify, and enhance simulation 
models on a personal computer. 

natural method of modelling 

When building a model in 
SIMSCRIPT, you describe the simu¬ 
lated system as consisting of certain 
types of entities: perhaps workers, 
machines and jobs in a simulated 
factory; or flights and airports in a 
simulated air transport system; or 
jobs, processors, channels, and I/O 
devices in a simulated computer 
system. 

For each type of entity you 
give names to the attributes that 
characterize it. 

You also name the sets an entity 
type may belong to, and the sets it 
may own. 

Since your model is English-like, 
with names that you choose, it reads 
like a description of the simulated 
system. The model can be read and 
verified by non-programmers who 
understand the system under study. 

This makes your model develop¬ 
ment, validation and evolutionary 
changes much easier. 


large models on your PC 

Your model and data are not 
limited by the size of the PC. 
SIMSCRIPT is the only simulation 
tool that automatically makes use of 
the hard disk as a memory extension. 

reduced cost 

SIMSCRIPT II.5® is a well estab¬ 
lished, standardized, and widely 
used language with proven software 
support. 

Experience has shown that SIM¬ 
SCRIPT II.5 reduces simulation 
programming time and cost 
severalfold compared to other 
simulation techniques. 

animated and graphical results 

With PC SIMSCRIPT II.5® you 
build models that can show an 
animated picture of the system under 
study. Observing the simulation 
improves understanding of the 
system and builds confidence in the 
model. 

Because you see the operation of 
the simulated system and can easily 
try alternatives, the time and cost of 
system analysis are sharply reduced. 

computers with SIMSCRIPT II.5 

1. IBM Personal Computer AT, 
XT, PC or compatible, with a hard 
disk. 

2. Most Mainframe computer types 
including IBM, CDC, VAX, Univac, 
Prime, Gould, Data General and 
Honeywell. 

SIMSCRIPT II.5 and PC SIMSCRIPT II.5 are registered 
trademarks and service marks of CACI, INC.-FEDERAL 


free trial 

The free trial package contains 
everything you need to try SIM¬ 
SCRIPT II.5 on your own computer. 

We send you PC or Mainframe 
SIMSCRIPT II.5, installation instruc¬ 
tions, sample models, and a complete 
set of documentation. You can build 
your own model or modify one of 
ours. No cost or obligation. 

special offer free training 

For a limited time we will also 
include free training. Space is limited 
so act now to avoid disappointment. 

Call Rick Crawford at (619) 
457-9681 to reserve your place. 

free trial- learn the reasons for the broad 
and growing popularity of SIMSCRIPT 
II.5—no cost or obligation 

special offer-return the coupon today 
and we will include one free course enroll¬ 
ment worth $850 
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I Address 

I City State Zip 

| ----- | 

| Telephone 

| Computer Operating System 


Return to: ieee micro 

CACI 

3344 North Torrey Pines Court 
La Jolla, California 92037 
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call Rick Crawford at (619) 457-9681 
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The Texas 
Instruments 

TMS320C25 
Digital Signal 

Microcomputer 
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Capable of 10 million operations per 
second, the newest member of the 
TMS320 family can serve as an 
inexpensive alternative to bit-slice 
processors or custom ICs in digital 
signal processing applications. 


D igital signal processing encompasses a variety of 
applications, including digital filtering, speech 
vocoding, image processing, fast Fourier trans¬ 
forms, and digital audio. 1-5 All DSP applications have 
several characteristics in common. First, they employ algo¬ 
rithms that are mathematically intensive. An example is the 
finite-duration impulse response, or FIR, filter, which in the 
time domain takes the form 
N 

y(n) = £ ff(z') • x(n-i), (1) 

1=1 

where y(n) is the output sample at time n, a(i) is the z'th 
coefficient or weighting factor, and x(n — i) is the (n — z)th 
input sample. From this equation, we can see that the FIR 
filter contains an abundance of multiplications and addi¬ 
tions (that is, sums of products). This equation is the 
general form of an FIR filter 6 as well as the convolution of 
two sequences of numbers a{i) and x(i ). 7 Both operations 
are fundamental to digital signal processing. 

Second, DSP algorithms must be performed in real time; 
i.e., they must not produce a delay noticeable to the user. In 
a speech recognition system, for example, the algorithms 
must not produce a noticeable delay between a word being 
spoken and that word being recognized. In an image pro¬ 
cessing system, processing needs to be completed within a 
frame update period. 

Third, all DSP applications involve the sampling of a 
signal. Referring to Equation 1, we can see that the output 
y(n ) is calculated to be the weighted sum of the previous N 
inputs. In other words, the input signal is sampled at 
periodic intervals, and the samples are multiplied by a 
weighting factor a(i) and then added together to give the 
output result y(n). In a typical DSP application, the pro¬ 
cessor must be able to perform arithmetic computations and 
effectively handle sampled data in large quantities. 

Last, DSP systems must be flexible enough to incorporate 
improvements in the state of the art. Many DSP techniques 
are still developing, and therefore their algorithms tend to 
change. Speech recognition, for example, is presently an in¬ 
exact technique still undergoing algorithmic modification. 
This implies that DSP systems need to be programmable so 
that they can easily accommodate revised algorithms. 

Over the past several decades, digital signal processing 
machines have taken several forms in response to applica¬ 
tion need and available technology. Array processors have 
long been the accepted solution for the research laboratory 
and have been extended to end applications in some in¬ 
stances. However, as integrated circuit technology has 
matured, digital signal processing has migrated from the ar¬ 
ray processor to the bit-slice processor to the single-chip 
processor. This has brought the cost of DSP solutions down 
to a point that allows pervasive use of the technology. 

The members of the TMS320 family of devices are ex¬ 
amples of the single-chip digital signal processor. The first 
member of the family, the TMS32010, was introduced to 
the market in 1983. 8,9 It can perform five million DSP 
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operations per second, including the add and multiply func¬ 
tions 10 required in Equation 1. The newest member of the 
family, the TMS320C25, can perform 10 million DSP 
operations per second, 11 and it combines the multiply/ 
accumulate functions into one single-cycle operation. 


Basic TMS320 architecture 

The fundamental attribute of a digital signal processor is 
fast arithmetic operations. The members of the TMS320 
family, 10-12 like many other digital signal processors, 
achieve fast arithmetic operations by employing 

• a Harvard architecture, 

• a dedicated hardware multiplier, 

• special DSP instructions, and 

• extensive pipelining. 

Use of these concepts allows a digital signal processor to 
handle a vast amount of data and execute most DSP opera¬ 
tions in a one-cycle instruction. 

The TMS320 family utilizes a modified Harvard architec¬ 
ture for speed and flexibility. In a strict Harvard architec¬ 
ture, 13,14 the program memory and data memory lie in two 
separate spaces, permitting a full overlap of the instruction 
fetch and execution. The TMS320 family’s modification of 
the Harvard architecture allows transfers between the pro¬ 
gram space and data space, thereby increasing the flexibility 
of the devices in the family. This architectural modification 
eliminates the need for a separate coefficient ROM and also 
maximizes processing power by maintaining two separate 
bus structures (program and data) for full-speed execution. 

The TMS320 family’s dedicated hardware multiplier em¬ 
ploys a 16 X 16-bit organization, which yields a 32-bit 
result and allows multiplication to take place in a single 
cycle. The special DSP instructions include DMOV (data 
move) and RPT (repeat), which speed up DSP operations. 
The extensive pipelining ensures maximum throughput for 
real-time applications. 


The TMS320C25 architecture 

The TMS320C25 digital signal processor is a micro¬ 
computer with a 32-bit internal Harvard architecture and 
a 16-bit external interface. It is a pin-compatible CMOS 
version of the TMS32020 microprocessor but has an in¬ 
struction execution rate twice as fast and includes addi¬ 
tional hardware and software features. The TMS320C25’s 
instruction set is a superset of that of the TMS32010 and 
that of the TMS32020, and it maintains source-code com¬ 
patibility with them. In addition, it is completely object-code¬ 
compatible with the TMS32020 so that TMS32020 programs 
can run unmodified on the TMS320C25. Some of the major 
features of the TMS320C25 are 

• a 32-bit ALU and accumulator, 

• an instruction cycle time of 100 ns, 


• a single-cycle multiply/accumulate, 

• use of low-power CMOS technology with a power¬ 
down mode, 

• 4K 16-bit words of on-chip masked ROM, 

• 544 words of on-chip data RAM, 

• 128K words of data/program memory space, 

• eight auxiliary registers with a dedicated arithmetic unit, 

• an eight-level hardware stack, 

• a fully static double-buffered serial port, 

• concurrent DMA that uses an extended hold operation, 

• bit-reversed addressing modes for fast Fourier trans¬ 
forms, 

• extended-precision arithmetic and adaptive filtering 
support, 

• full-speed operation of data move instructions from ex¬ 
ternal memory, 

• an accumulator carry bit and related instructions, and 

• fabrication in 1.8-/im CMOS and packaging in a 68-pin 
PLCC. 

The 100-ns instruction cycle time provides a significant 
throughput advantage for many applications. Since most of 
the TMS320C25’s instructions can execute in a single cycle, 
it can execute 10 million instructions per second. Most of 
the other features listed above also contribute to the 
TMS320C25’s high throughput. 

The TMS320C25 includes instructions to perform the 
data transfers between program space and memory space 
discussed earlier. Externally, the program and data memory 
spaces are multiplexed over the same bus so as to maximize 
the address range for both spaces and minimize the pin 
count of the device. Internally, the TMS320C25 architecture 
maximizes processing power by maintaining two separate 
bus structures, program and data, for full-speed execution. 

Program execution in the device takes the form of a 
three-level instruction fetch-decode-execute pipeline. This 
pipeline is invisible to the user except in cases in which it 
must be broken, such as for branch instructions. In this 
case, the instruction timing takes into account the fact that 
the pipeline must be emptied and refilled. 

Two large, on-chip data RAM blocks (a total of 544 
words), one of which is configurable either as program or 
data memory, are provided. An off-chip, 64K-word, 
directly addressable data memory address space is included 
to facilitate implementations of DSP algorithms with large 
data memory requirements. Four-K words of on-chip pro¬ 
gram ROM and 64K words of off-chip program address 
space are available. Large programs can execute at full 
speed from this memory space. Programs can also be 
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Figure 1. 
TMS320C25 
block diagram. 



LEGEND: 

ACCH Accumulator high IFR 

ACCL - Accumulator low IMR 

ALU - Arithmetic logic unit IR 

ARAU - Auxiliary register arithmetic unit MCS 
ARB - Auxiliary register pointer buffer QIR 

ARP ^ Auxiliary register pointer PR 

DP - Data memory page pointer PRD 

DRR - Serial port data receive register TIM 

DXR - Serial port data transmit register TR 


Interrupt flag register 

PC 

Program counter 

Interrupt mask register 

PFC 

Prefetch counter 

Instruction register 

RPTC 

Repeat instruction counter 

Microcall stack 

GREG 

Global memory allocation register 

Queue instruction register 

RSR 

Serial port receive shift register 

Product register 

XSR 

Serial port transmit shift register 

Period register for timer 

AR0-AR7 

Auxiliary registers 

Timer 

ST0.ST1 

Status registers 


Temporary register 
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•Block BO is addressed as program memory after a CNFP instruction, and as data memory altera CNFD instruction. 



Figure 2. TMS320C25 memory maps. 


downloaded from slow external memory to on-chip RAM 
for full-speed operation. 

The TMS320C25 also incorporates a hardware timer and 
a block data transfer capability. 

The diagram of the TMS320C25 in Figure 1 shows the 
principal blocks and data paths within the processor. It also 
shows all of the TMS320C25’s interface pins. 

The TMS320C25’s architecture is built around the pro¬ 
gram and data buses. The program bus carries the instruc¬ 
tion code and immediate operands from program memory. 
The data bus interconnects elements such as the central 
arithmetic logic unit (CALU) and the auxiliary register file 
to the data RAM. Together, the program and data buses can 
carry data from on-chip data RAM and internal or external 
program memory to the multiplier in a single cycle for mul¬ 
tiply/accumulate operations. 

A high degree of parallelism exists in the device—for 
example, while data are being operated on by the CALU, 
arithmetic operations can be implemented in the auxiliary 
register arithmetic unit (ARAU). Such parallelism results in 
a powerful set of arithmetic, logical, and bit-manipulation 
operations that can be performed in a single machine cycle. 

Memory allocation. As mentioned above, the TMS320C25 
provides 4K 16-bit words of on-chip program ROM and 544 


16-bit words of on-chip data RAM. The RAM is divided 
into three blocks, BO, Bl, and B2. Of the 544 words, 256 
words (block BO) are configurable as either data memory or 
program memory; 288 words (blocks Bl and B2) are always 
data memory. A data memory size of 544 words allows the 
TMS320C25 to handle a data array of 512 words but still 
leaves 32 locations for intermediate storage. 

The TMS320C25 maintains separate address spaces for 
program memory, data memory, and I/O. In addition to 
blocks BO, Bl, and B2, the on-chip data memory map (see 
Figure 2) includes memory-mapped registers. Six peripheral 
registers, the serial-port registers (DRR and DXR), timer 
register (TIM), period register (PRD), interrupt mask 
register (IMR), and global memory allocation register 
(GREG), have been mapped into the data memory space so 
they can be easily modified. 

The TMS320C25 has a register file containing eight aux¬ 
iliary registers that can be used for indirect addressing of 
data memory or for temporary storage. These registers, 
AR0-AR7, can be either directly addressed by an instruction 
or indirectly addressed by a three-bit auxiliary register 
pointer (ARP). The auxiliary registers and the ARP can be 
loaded either from data memory or by an immediate 
operand defined in the instruction. The contents of the 
registers can also be stored in data memory. 
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Figure 3. Auxiliary 
register file. 



The auxiliary register file is connected to the auxiliary 
register arithmetic unit as shown in Figure 3. The ARAU 
can autoindex the current auxiliary register while the data 
memory location is being addressed. The current auxiliary 
register can also be indexed either by + 1/- 1 or by the 
contents of ARO. As a result, the accessing of tables of in¬ 
formation does not require the CALU for address manipu¬ 
lation, thereby freeing it for other operations. 

Although the ARAU was designed to support address 
manipulation in parallel with other operations, it can also 
serve as an additional general-purpose arithmetic unit since 
the auxiliary register file can communicate directly with data 
memory. The ARAU implements 16-bit unsigned arithme¬ 
tic, whereas the CALU implements 32-bit two’s-comple- 
ment arithmetic. The ARAU also provides branches depen¬ 
dent on the comparison of ARO to the auxiliary register 
pointed to by the ARP. 

Central arithmetic logic unit. The CALU contains a 
16-bit scaling shifter, a 16 x 16-bit parallel multiplier, a 
32-bit ALU, and a 32-bit accumulator. The scaling shifter 
has a 16-bit input connected to the data bus and a 32-bit 
output connected to the ALU. This shifter produces a left 
shift of 0 to 16 bits on the input data, as programmed in the 
instruction. The least significant bits of the output are filled 
with zeroes, and the most significant bits are either filled 
with zeroes or sign-extended, depending upon the state of 
the sign-extension mode bit of status register ST1. Addi¬ 
tional shifters at the outputs of both the accumulator and 
the multiplier are suitable for numerical scaling, bit extrac¬ 
tion, extended-precision arithmetic, and overflow preven¬ 
tion. Due to the pipelining in the TMS320C25, shifting is 
accomplished as part of an instruction and thus does not re¬ 
quire additional cycles for execution. 

The 32-bit ALU and accumulator perform a wide range 
of arithmetic and logical instructions. An overflow satura¬ 
tion mode permits the accumulator to be loaded with the 
most positive or negative number (the choice depending on 


the direction of overflow), and it allows an overflow flag to 
be set whenever an overflow occurs. One of the two inputs 
to the ALU is always provided from the accumulator, and 
the other may be transferred from the product register (PR) 
of the multiplier or from the scaling shifter loaded from 
data memory. 

The implementation of a typical ALU instruction requires 
these steps: 

• data are fetched from the Rcarrn the data bus; 

• data are passed through the scaling shifter and through 
the ALU, where the arithmetic is performed; and 

• the result is moved into the accumulator. 

The 32-bit accumulator is split into two 16-bit segments 
for storage in data memory: ACCH (accumulator high) and 
ACCL (accumulator low). Shifters at the output of the ac¬ 
cumulator provide a shift of 0 to 7 places to the left. This 
shift is performed while the data are being transferred to the 
data bus for storage. The contents of the accumulator re¬ 
main unchanged. The accumulator also has an in-place one- 
bit shift to the left or right (SFL or SFR instruction) and a 
rotate through carry (ROL or ROR instruction) for shifting 
its contents. 

A carry bit is provided to the accumulator, allowing more 
efficient extended-precision computation. ADDC (add with 
carry) and SUBB (subtract with borrow) are two instruc¬ 
tions using the carry bit. Branch instructions that use the 
carry bit are also provided. 

Hardware multiplier. The TMS320C25 uses a 16 x 16-bit 
hardware multiplier that can compute a 32-bit product dur¬ 
ing every machine cycle. Two registers are associated with 
the multiplier: a 16-bit temporary register (TR) that holds 
one of the operands for the multiplier, and a 32-bit product 
register (PR) that holds the product. 

The output of the product register can be left-shifted one 
or four bits. This is useful for implementing fractional 
arithmetic or justifying fractional products. The output of 
the PR can also be right-shifted six bits to enable the execu- 
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tion of up to 128 consecutive multiply/accumulates without 
overflow. 

The multiplier performs both signed and unsigned opera¬ 
tions. Two signed instructions, MAC (multiply/accumulate) 
and MACD (multiply/accumulate and data move), can pro¬ 
cess both operands simultaneously, thereby fully utilizing 
the computational bandwidth of the multiplier. For MAC 
and MACD, the two operands are transferred to the mul¬ 
tiplier at each cycle via the program and data buses. This 
enables MAC and MACD to be performed in a single cycle 
when they are used with repeat (RPT or RPTK) instruc¬ 
tions. The program bus can supply data from internal or ex¬ 
ternal memory (RAM or ROM) and still maintain single¬ 
cycle operation. An unsigned multiply (MPYU) instruction 
facilitates extended-precision multiplication. It multiplies 
the unsigned contents of the TR by the unsigned contents of 
the addressed data memory location, and places the result in 
the PR. 

Control operations. Control operations are provided on 
the TMS320C25 by an on-chip timer, a repeat counter, three 
external maskable user interrupts, and internal interrupts 
generated by serial-port operations or by the timer. 

A memory-mapped 16-bit timer (TIM) register (a down 
counter) is continuously clocked by CLKOUT1. A timer in¬ 
terrupt (TINT) is generated whenever the timer decrements 
to zero. The timer is reloaded with the value contained in 
the period (PRD) register within the first cycle after it 
reaches zero so that interrupts may be programmed to occur 
at regular intervals of (PRD + 1) * CLKOUT1 cycles. This 
feature is useful for control operations and for synchronous 
sampling of or writing to peripherals. 

The repeat counter (RPTC) is loaded with either a data 
memory value (in the case of the RPT instruction) or an im¬ 
mediate value (in the case of the RPTK instruction). The 
repeat feature enables a single instruction to be executed up 
to 256 times. It can be used with instructions such as mul¬ 
tiply/accumulates, block moves, I/O transfers, and table 
read/writes. Those instructions that are normally multicycle 
are pipelined when the repeat feature is used and effectively 
become single-cycle instructions. For example, the table 
read (TBLR) instruction ordinarily takes three or more 
cycles, but when it is repeated, it becomes a single-cycle 
instruction. 

Th e three external maskable user interrupts, INT2 to 
INTO, enable external devices to interrupt the processor. 
Internal interrupts are generated by either the serial port, 
the timer, or the software interrupt instruction. Interrupts 
are prioritized, with reset having the highest priority and the 
serial-port transmit interrupt the lowest. 

Serial port. An on-chip serial port provides direct com¬ 
munication with serial devices such as codecs and serial 
A/D and D/A converters. The serial port’s interface re¬ 
quires a minimum of external hardware. The port has two 
memory-mapped registers—a data transmit register and a 
data receive register—which can be operated in either an 
eight-bit byte mode or a 16-bit word mode. The transmit 


framing sync pulse can be generated internally or externally. 
The serial port’s maximum speed is 5 MHz. 

The primary enhancements of the TMS320C25’s serial 
port are 

• double buffering for both receive and transmit opera¬ 
tions, 

• the elimination of a minimum CLKR/CLKX frequency 
(fmin = 0 Hz), and 

• the provision of a frame sync mode (FSM) bit, which 
allows continuous operation with no frame sync pulses. 

The FSM is useful for communicating on pulse-code- 
modulated telephone system highways. As a result the TMS- 
320C25 can communicate directly on PCM highways such 
as AT&T T-l and CCITT G.711/712 by counting the trans¬ 
mitted and received bytes in software and performing the 
instructions needed to set (SFSM) and reset (RFSM) the 
FSM bit. 

I/O interface. The TMS320C25’s I/O space consists of 16 
input and 16 output ports. These ports provide a full 16-bit 
parallel I/O interface via the processor’s data bus. A single 
input (IN) or output (OUT) operation typically takes two 
cycles; however, when executed in the repeat mode, such an 
operation becomes single-cycle. The TMS320C25 supports a 
range of system interfacing requirements. As previously 
mentioned, three separate address spaces—program, data, 
and I/O—provide interfacing to memory and I/O, thereby 
maximizing system throughput. The TMS320C25 simplifies 
I/O design by treating I/O the same way it treats memory. 

It maps I/O devices into the I/O address space using its ex¬ 
ternal address and data buses in the same way as it uses 
them for mapping memory devices into memory address 
space. 

The local memory interface consists of a 16-bit parallel 
data bus (D 15-DO), a 16-bit address bus (A15-A0), three 
pins for data memory, program memory, and I/O space 
select (DS, PS, and IS, respectively), and various system 
control signals. T he R/W signal controls the direction of a 
data transfer, and STRB provides a timing signal to control 
the transfer. When using on-chip program RAM, ROM, or 
high-speed external program memory, the TMS320C25 runs 
at full speed without wait states. By using the READY 
signal, it can generate wait states so it can communicate 
with slower off-chip memories. 

The TMS320C25 supports direct memory access to exter¬ 
nal program and data memory. Another processor can take 
complete control of the TMS320C25’s external memory by 
asserting HOLD low, causing the TMS320C25 to place its 
address, data, and control lines in the high-impedance state. 
Two modes are available on the device. In the first mode, 
execution is suspended during assertion of HOLD. In the 
second mode—the “concurrent DMA mode”—the TMS- 
320C25 continues to execute its program while operating 
from internal RAM or ROM, thereby greatly increasing 
throughput in data-intensive applications. Signaling be¬ 
tween the external processor and the TMS320C25 can be 
performed through interrupts. 
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Table 1. 

TMS320C25 instructions. 



ACCUMULATOR MEMORY REFERENCE INSTRUCTIONS 

MNEMONIC 

DESCRIPTION 

NO. 

WORDS 

OPERATION 

ABS 

Absolute value of accumulator 

1 

(ACC) - ACC 

ADO 

Add to accumulator with shift 

1 

(ACC) + |(dma) x 2 sh,ft | ACC 

ADDC* 

Add to accumulator with carry 

1 

(ACC) + (dma) + (C) - ACC 

ADDH 

Add to high accumulator 

1 

(ACC) + [(dma) x 2 16 | - ACC 

ADDK* 

Add to accumulator short immediate 

1 

(ACC) + 8 bit constant - ACC 

ADDS 

Add to low accumulator with sign 

extension suppressed 

1 

(ACC) + (dma) - ACC 

ADDT t 

Add to accumulator with shift specified by 

T register 

1 

(ACC) + l(dma) x 2 ,Tre a>) -> ACC 

ADLK* 

Add to accumulator long immediate with shift 

2 

(ACC) + 116.bit constant x 2 shl,, | -* ACC 

AND 

AND with accumulator 

1 

(ACC( 1 5-0)).AND.(dma) - ACCI15 0). 

0 • ACCI31-16) 

ANDK T 

AND immediate with accumulator with shift 

2 

(ACC130 0)).AND.i 16-bit constant x 2 shlft | • 
ACCI30 0). 0 - ACCOO-O) 

CMPLt 

Complement accumulator 

1 

(ACC) - ACC 

LAC 

Load accumulator with shift 

1 

(dma) x 2 sh,f ’ - ACC 

LACK 

Load accumulator immediate short 

1 

8 bit constant -♦ ACC 

LACT 

Load accumulator with shift specified by T register 

1 

(dma) x 2< Tre 9> - ACC 

LALK T 

Load accumulator long immediate with shift 

2 

(16 bit constant) x 2 16 =-* ACC 

NEGt 

Negate accumulator 

1 

(ACC) • ACC 

NORM t 

Normalize contents of accumulator 

1 


OR 

OR with accumulator 

1 

(ACC(1 5-01).OR. (dma) - ACCI15-0) 

ORK t 

OR immediate with accumulator with shift 

2 

(ACCI30-0)).OR.116-bit constant x 2 shlft | - 
ACCI30 0) 

ROL* 

Rotate accumulator left 

1 

(ACCOO-O)) - ACCOM), (C) * ACCIO), 

(ACCI31)) • C 

ROR* 

Rotate accumulator right 

1 

(ACCOM)) - ACCOO-O), (C) - ACCI31), 

(ACC <0!) - C 

SACH 

Store high accumulator with shift 

1 

| (ACC) x 2 shi,t 1 • dma 

SACL 

Store low accumulator with shift 

1 

[(ACCL) x 2 shl,t ] • dma 

SBLK t 

Subtract from accumulator long immediate with shift 

2 

(ACC) (16 bit constant x 2 sh, * t l ACC 

SFLt 

Shift accumulator left 

1 

(ACCOO-O)) * ACCOM), 0 -» ACCIO) 

SFR* 

Shift accumulator right 

1 

(ACCOM)) - ACCOO-O), (ACC(31)) - ACCI31) 

SUB 

Subtract from accumulator with shift 

1 

(ACC) [(dma) x 2 shlfl | - ACC 

SUBB* 

Subtract from accumulator with borrow 

1 

(ACC) (dma) (C) ACC 

SUBC 

Conditional subtract 

1 

(ACC! |(dma) x 2 16 | ACC 

SUBH 

Subtract from high accumulator 

1 

SUBK* 

Subtract from accumulator short immediate 

1 

(ACC) 8 bit constant -* ACC 

SUBS 

Subtract from low accumulator with sign 

extension suppressed 

1 

(ACC) (dma) ACC 

SUBT f 

Subtract from accumulator with shift specified by 

T register 

1 

(ACC) |(dma) x 2' Tr eg)| . ACC 

XOR 

Exclusive-OR with accumulator 

1 

(ACCI1 5-0)).XOR.(dma) * ACCI15 0) 

XORK t 

Exclusive OR immediate with accumulator with shift 

2 

(ACC(30-0!).XOR.[16-bit constant x 2 shlft | -» 
ACCOO-O) 

ZAC 

Zero accumulator 

1 

0 - ACC 

ZALH 

Zero low accumulator and load high accumulator 

1 

(dma) x 2 16 ACC 

ZALR* 

Zero low accumulator and load high accumulator 

with rounding 

1 

(dma) x 2^® + >8000 + ACC 

ZALS 

Zero accumulator and load low accumulator with 
sign extension suppressed 

1 

(dma) > ACCL, 0 * ACCH 


^These instructions are not included in the TMS32010 instruction set. 
*These instructions are not included in the TMS32020 instruction set. 
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AUXILIARY REGISTERS AND DATA PAGE POINTER INSTRUCTIONS 

MNEMONIC 

DESCRIPTION 

NO. 

WORDS 

OPERATION 

ADRK* 

Add to auxiliary register short immediate 

1 

(ARn) + 8-bit constant -* ARn 


CMPR* 

Compare auxiliary register with auxiliary register ARO 

1 

If ARn 1 CM | ARO, then 1 -» TC; else 0 

- TC 

LAR 

Load auxiliary register 

1 

(dma) (ARn) 


LARK 

Load auxiliary register short immediate 

1 

8-bit constant ARn 


LARP 

Load auxiliary register pointer 

1 

3-bit constant ARP, (ARP) ARB 


LDP 

Load data memory page pointer 

1 

(dma) — DP 


LDPK 

Load data memory page pointer immediate 

1 

9-bit constant -» DP 


LRLK 1 

Load auxiliary register long immediate 

2 

16-bit constant -» ARn 


MAR 

Modify auxiliary register 

1 



SAR 

Store auxiliary register 

1 

(ARn) -* dma 


SBRK * 

Subtract from auxiliary register short immediate 

1 

(ARn) - 8-bit constant -» ARn 


T REGISTER, P REGISTER, AND MULTIPLY INSTRUCTIONS 

MNEMONIC 

DESCRIPTION 

NO. 

WORDS 

OPERATION 

APAC 

Add P register to accumulator 

1 

(ACC) + (shift Preg) - ACC 


LPH 1 

Load high P register 

1 

(dma) -♦ Preg (31-16) 


LT 

Load T register 

1 

(dma) -♦ Treg 


LTA 

Load T register and accumulate previous product 

1 

(dma) -* Treg, (ACC) + (shifted Preg) 

ACC 

LTD 

Load T register, accumulate previous product, 

1 

(dma) -* Treg, (dma) dma + 1, 



and move data 


(ACC) + (shifted Preg) ACC 


LTPt 

Load T register and store P register in accumulator 

1 

(dma) -* Treg, (shifted Preg) -* ACC 


LTS t 

Load T register and subtract previous product 

1 

(dma) Treg, (ACC) - (shifted Preg) -* 

ACC 

MAC 1 

Multiply and accumulate 

2 

(ACC) + (shifted Preg) -» ACC, 





(pma) x (dma) -♦ Preg 


MACD 1 

Multiply and accumulate with data move 

2 

(ACC) + (shifted Preg) -» ACC, 





(pma) x (dma) -» Preg, (dma) -* dma 

+. ■: 1 

MPY 

Multiply (with T register, store product in P register) 

1 

(Treg) x (dma) -♦ Preg 


MPYA* 

Multiply and accumulate previous product 

1 

(ACC) + (shifted Preg) -♦ ACC, 





(Treg) x (dma) -* Preg 


MPYK 

Multiply immediate 

1 

(Treg) x 13-bit constant Preg 


MPYS* 

Multiply and subtract previous product 

1 

(ACC) - (shifted Preg) - ACC, 





(Treg) x (dma) -» Preg 


MPYU* 

Multiply unsigned 

1 

Usgn (Treg) x Usgn (dma) -* Preg 


PAC 

Load accumulator with P register 

1 

(shifted Preg) ACC 


SPAC 

Subtract P register from accumulator 

1 

(ACC) - (shifted Preg) -* ACC 


SPH * 

Store high P register 

1 

(shifted Preg (31-16)) -* dma 


SPL* 

Store low P register 

1 

(shifted Preg (15-0)) -* dma 


SPM t 

Set P register output shift mode 

1 

2-bit constant -* PM 


SOFIA* 

Square and accumulate 

1 

(ACC) + (shifted Preg) ACC, 





(dma) x (dma) 'H Preg 


SQRS t 

Square and subtract previous product 

1 

(ACC) - (shifted Preg) ACC, 





(dma) x (dma) -» Preg 



SYMBOL 

MEANING 

ACC 

Accumulator 

ARB 

Auxiliary register pointer buffer 

ARn 

Auxiliary Register n (ARO through AR7 are predefined 


assembler symbols equal to 0 through 7 respectively ) 

ARP 

Auxiliary register pointer 

BIO 

Branch control input 

C 

Carry bn 

CM 

2 bit field specifying compare mode 

CNF 

On chip RAM configuration control bit 

dma 

Data memory address 

DP 

Data page pointer 

FO 

Format status bit 

FSM 

Frame synchronization mode bit 

HM 

Hold mode bit 

INTM 

Interrupt mode flag bit 

>nn 

Indicates nn is a hexadecimal number (All others are 


assumed to be decimal values ) 

ov 

Overflow flag bit 

OVM 

Overflow mode bn 

P 

Product register 


SYMBOL 

MEANING 

PA 

Port address (PAO through PA1 5 are predefined assembler 


symbols equal to 0 through 15 respectively ) 

PC 

Program counter 

PM 

2 bn field specifying P register output shift code 

pma 

Program memory address 

Preg 

Product register 

RPTC 

Repeat counter 

STn 

Status Register n (STO or ST1) 

SXM 

Sign extension mode bit 

T 

Temporary register 

TC 

Test control bit 

TOS 

Top of stack 

Treg 

Temporary register 

TXM 

Transmit mode bit 

Usgn 

Unsigned value 

XF 

XF pm status bit 

— 

Is assigned to 


An absolute value 


Optional items 

( ) 

Contents of 
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BRANCH/CALL INSTRUCTIONS 

MNEMONIC 

DESCRIPTION 

NO. 

WORDS 

OPERATION 

B 

Branch unconditionally 

2 

pma -> PC 

BACCt 

Branch to address specified by accumulator 

1 

(ACCI1 5-01) - PC 

BANZ 

Branch on auxiliary register not zero 

2 

If (AR(ARPI) * 0. then pma - PC; else (PC) + 2 —- 




PC 

BBNZ* 

Branch if TC bit * 0 

2 

If (TC) ~ 1, then pma -» PC; else (PC) + 2 PC 

BBZ T 

Branch if TC bit = 0 

2 

If (TC) - 0, then pma -* PC; else (PC) + 2 -•* PC 

BC* 

Branch on carry 

2 

If (C) = l, then pma -> PC; else (PC) + 2 -* PC 

BGEZ 

Branch if accumulator 2; 0 

2 

If (ACC) > 0, then pma -* PC; else (PC) + 2 -* PC 

BGZ 

Branch if accumulator > 0 

2 

If (ACC) > 0, then pma -* PC; else (PC) + 2 -> PC 

BIOZ 

Branch on I/O status = 0 

2 

If (BIO) = 0, then pma — PC; else (PC) + 2 — PC 

BLEZ 

Branch if accumulator ^ 0 

2 

If (ACC) < 0, then pma -* PC; else (PC) + 2 -* PC 

BLZ 

Branch if accumulator < 0 

2 

If (ACC) < 0, then pma -> PC; else (PC) + 2 -* PC 

BNC* 

Branch on no carry 

2 

If (C) = 0, then pma PC; else (PC) + 2 -* PC 

BNVt 

Branch if no overflow 

2 

If (OV) * 0, then pma -» PC; else (PC) + 2 -» PC 

BNZ 

Branch if accumulator * 0 

2 

If (ACC) * 0, then pma -» PC; else (PC) + 2 -* PC 

BV 

Branch on overflow 

2 

If (OV) = 0, then pma -> PC; else (PC) + 2 -* PC 

BZ 

Branch if accumulator = 0 

2 

If (ACC) = 0, then pma -*• PC; else (PC) + 2 PC 

CALA 

Call subroutine indirect 

1 

(ACCI15-0)) - PC, (PC) + 1 - TOS 

CALL 

Call subroutine 

2 

(PC) + 2 - TOS, pma - PC 

RET 

Return from subroutine 

1 

(TOS) - PC 

I/O AND DATA MEMORY OPERATIONS 

MNEMONIC 

DESCRIPTION 

NO. 

WORDS 

OPERATION 

BLKD f 

Block move from data memory to data memory 

2 

(dmal, addressed by PC) -* dma2 

BLKPt 

Block move from program memory to data memory 

2 

(pma, addressed by PC) dma 

DMOV 

Data move in data memory 

1 

(dma) — dma + 1 

FORT* 

Format serial port registers 

1 

1 -bit constant -> FO 

IN 

Input data from port 

1 

(data bus, addressed by PA) -* dma 

OUT 

Output data to port 

1 

(dma) -* data bus, addressed by PA 

RFSM* 

Reset serial port frame synchronization mode 

1 

0 - FSM 

RTXM* 

Reset serial port transmit mode 

1 

0 - TXM 

RXF + 

Reset external flag 

1 

0 - XF 

SFSM* 

Set serial port frame synchronization mode 

1 

1 - FSM 

STXM * 

Set serial port transmit mode 

1 

1 - TXM 

SXFt 

Set external flag 

1 

1 - XF 

TBLR 

Table read 

1 

(pma, addressed by ACC (15-0)) dma 

TBLW 

Table write 

1 

(dma) -> pma, addressed by ACC (15-0) 

CONTROL INSTRUCTIONS 

MNEMONIC 

DESCRIPTION 

NO. 

WORDS 

OPERATIONS 

BIT + 

Test bit 

1 

(dma bit at (15-bit code)) — TC 

BITT t 

Test bit specified by T register 

1 

(dma bit at (15-Treg)) — TC 

CNFD 1 

Configure block as data memory 

1 

0 - CNF 

CNFP 1 

Configure block as program memory 

1 

1 - CNF 

DINT 

Disable interrupt 

1 

1 - INTM 

EINT 

Enable interrupt 

1 

0 - INTM 

IDLE I 

Idle until interrupt 

1 

(PC) + 1 -* PC, powerdown 

LST 

Load status register STO 

1 

(dma) r* STO 

LST1 t 

Load status register ST1 

1 

(dma) -► ST1 


f These instructions are not included in the TMS32010 instruction set. 
*These instructions are not included in the TMS32020 instruction set. 
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CONTROL INSTRUCTIONS cont’d 

NOP 

No operation 

1 

(PC) + 1 -> PC 

POP 

Pop top of stack to low accumulator 

1 

rrosi - acc 

POPD^ 

Pop top of stack to data memory 

1 

(TOSI dma 

PSHD t 

Push data memory value onto stack 

1 

(dma) - TOS 

PUSH 

Push low accumulator onto stack 

1 

(ACCL) - TOS 

RC* 

Reset carry bit 

1 

0 - C 

RHM* 

Reset hold mode 

1 

0 - HM 

ROVM 

Reset overflow mode 

1 

0 - OVM 

RPT t 

Repeat instruction as specified by data memory value 

1 

(dma) - RPTC 

RPThU 

Repeat instruction as specified by immediate value 

1 

8-bit constant RPTC 

RSXM t 

Reset sign-extension mode 

1 

0 - SXM 

RTC* 

Reset test/control flag 

1 

0 -> TC 

SC* 

Set carry bit 

1 

1 - C 

SHM* 

Set hold mode 

1 

1 - HM 

SOVM 

Set overflow mode 

1 

1 - OVM 

SST 

Store status register STO 

1 

STO — dma 

SST11 

Store status register ST1 

1 

ST1 -* dma 

SSXM f 

Set sign-extension mode 

1 

1 - SXM 

STC* 

Set test/control flag 

1 

1 - TC 

TRAP* 

Software interrupt 

1 

(PC) + 1 • TOS, 30 - PC 


The TMS320C25’s conditions and modes are stored in 
two status registers, STO and ST1. Instructions are provided 
to allow these registers to be stored in or loaded from data 
memory. This capability allows the current status of the 
device to be saved during interrupts and subroutine calls. 

TMS320C25 software 

Earlier, we characterized digital signal processing as the 
real-time processing of mathematically intensive algorithms. 
This characterization equates to a requirement for high¬ 
speed, multiply/accumulate capability in a processor. The 
performance of a signal processor is therefore measured in 
terms appropriate to this requirement—that is, it is mea¬ 
sured in terms of the speed of execution of individual in¬ 
structions, the power of the instruction set, and the I/O 
capabilities. The speed is given as the basic instruction cycle 
time and the number of cycles required to complete any 
instruction. 

As we noted earlier, pipelining of instruction fetching, 
decoding, and execution provides an instruction cycle time 
of only 100 ns. The overwhelming majority of the 
TMS320C25’s instructions (97 out of 133) are executed in 
a single instruction cycle. Of the 36 instructions requiring 
additional cycles for execution, 21 involve branches, calls, 
and returns that result in a reload of the program counter 
and a break in the execution pipeline. Another seven of 
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the instructions are two-word, long immediate instruc¬ 
tions. The remaining eight—IN, OUT, BLKD, BLKP, 

TBLR, TBLW, MAC, and MACD—support I/O and 
transfers of data between memory spaces, or provide for 
additional parallel operation in the processor. Further¬ 
more, these eight instructions become single-cycle when 
used in conjunction with the repeat counter. The instruc¬ 
tion set of the TMS320C25 exploits the parallelism of the 
processor, allowing complex or numerically intensive com¬ 
putations to be implemented in relatively few instructions. 

Table 1 lists the TMS320C25’s instructions. 

Addressing modes. Most TMS320C25 instructions are 
coded in a single 16-bit word—the reason most can be exe¬ 
cuted in a single cycle. The 16-bit word comprises an eight- 
bit opcode and an eight-bit address. Three memory address¬ 
ing modes are available: direct, indirect, and immediate 
(Table 2). Both direct and indirect addressing are used to 
access data memory. Immediate addressing uses the contents 
of the memory addressed by the program counter. Figure 4 
illustrates operand addressing in the direct, indirect, and im¬ 
mediate modes. 

In direct addressing, seven bits of the instruction word 
are concatenated with the nine-bit data memory page 
pointer (DP) to form the 16-bit data memory address. The 
DP register points to one of 512 possible data memory 
pages, each 128 word in length, to obtain a 64K total data 
memory space. The seven-bit address in the instruction 
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Table 2. 

Addressing modes. 


ADDRESSING MODE 

OPERATION 

OP A 

OP ‘(.NARP) 

OP * + (,NARP) 

OP *-(,NARP) 

OP *0+ (,NARP) 

OP *()-(,NARP) 

OP *BR0+ (,NARP) 

OP *BR0-(,NARP) 

Direct addressing 

Indirect; no change to AR 

Indirect; current AR is incremented. 

Indirect; current AR is decremented. 

Indirect; ARO is added to current AR. 

Indirect; ARO is subtracted from current AR. 
Indirect; ARO is added to current AR (with 
reverse carry propagation). 

Indirect; ARO is subtracted from current AR 
(with reverse carry propagation). 


NOTE The optional NARP field specifies a new value of the ARP 


points to the specific location within the data memory page. 

Indirect addressing is provided by the eight auxiliary 
registers AR0-AR7. These registers can be used to indirectly 
address data memory, as loop counters, or for temporary 
data storage. Indirect auxiliary register addressing (Figure 5) 
allows placement of the data memory address of an instruc¬ 
tion operand into one of the eight auxiliary registers. These 
registers are pointed to by a three-bit auxiliary register 
pointer (ARP) that is loaded with a value from 0 through 7 
designating ARO through AR7, respectively. The auxiliary 
registers and the ARP may be loaded either from data 
memory or by an immediate operand defined in the instruc¬ 
tion. Furthermore, the contents of the auxiliary registers 
may be stored in data memory. 

There are seven types of indirect addressing (see Table 2 
again): 

• indexing with increment, 

• indexing with decrement, 

• indexing by adding the contents of ARO, 

• indexing by subtracting the contents of ARO, 

• indexing by adding the contents of ARO with the carry 
propagation reversed (for bit-reversing an FFT), 

• indexing by subtracting the contents of ARO with the 
carry propagation reversed (also for bit-reversing an FFT), 
and 

• no indexing. 

All indexing operations are performed on the current aux¬ 
iliary register in the same cycle as the original instruction, 
with loading of a new ARP value available as an option. 
The operations performed in the ARAU can even be per¬ 
formed during branch instruction execution, allowing effi¬ 
cient control with conditional looping. 

Bit-reversed indexed addressing modes allow efficient I/O 
to be performed for the resequencing of data points in a 
radix-2 FFT program. The direction of carry propagation in 
the ARAU is reversed when this mode is selected, and ARO 
is added to or subtracted from the current auxiliary register. 

In immediate addressing, the instruction word contains 
the value of the immediate operand. Both single-word (8-bit 
and 13-bit constant) short immediate instructions and two- 
word (16-bit constant) long immediate instructions are in¬ 
cluded in the instruction set. In the case of long immediate 


instructions, the word following the instruction opcode is 
used as the immediate operand. MPYK is an example of an 
immediate instruction; it multiplies the contents of the T 
register by a signed 13-bit constant. Seventeen immediate 
operand instructions are included in the instruction set (see 
Table 1 again). 

Instruction set parallelism—an example. The MACD 
(multiply/accumulate and data move) instruction serves as 
an informative example of the parallelism designed into the 
TMS320C25 instruction set as well as into the TMS320C25 
architecture. As shown in Equation 1, the requirement for 
parallelism exists in common DSP operations such as con¬ 
volution and filtering. 6 ’ 7 

Parallelism in the execution of instructions enables a 
complete multiply /accumulate/data move operation to be 
completed in a single 100-ns instruction cycle. The execution 
of the MACD involves the following steps: 

1) The contents of the 32-bit P register are shifted (scaled) 
by an output shifter. 

2) The 32-bit ALU accumulates the shifted result of the 
32-bit P register with the current contents of the 32-bit 
accumulator. 

3) The 16-bit contents of a data memory location (usually 
addressed indirectly via one of the auxiliary registers) are 
loaded into the T register. 

4) The 16-bit contents of a program memory location 
(addressed via the prefetch counter PFC) are introduced to 
the multiplier and a 16 X 16-bit multiply is executed, 
resulting in a new 32-bit product. The product is placed in 
the P register to be accumulated during the next cycle. 

5) The 16-bit contents of the data memory location are 
copied to the next higher data memory address. 

6) The carry and overflow status bits are set, as ap¬ 
propriate, in the status registers. 

7) The 16-bit contents of the auxiliary register pointed to 
by the ARP are modified (typically decremented) in 
preparation for the use of the data memory address on the 
next cycle. 

8) The 16-bit contents of the PFC are incremented in 
preparation for the use of the program memory address on 
the next cycle. 

9) The repeat counter is decremented. 

As can be seen from the above, one of the data values is 
taken from data memory while the other is taken from pro¬ 
gram memory. A single-cycle execution and data move is ac¬ 
complished when the data memory being addressed is the 
on-chip data memory. The program memory location can 
be either on or off chip and, if on chip, can come from 
either ROM or the reconfigurable memory block BO. 

Parallel operation of certain subsets of TMS320C25 func¬ 
tions is also available. These subsets include loading the T 
register in combination with addition (LTA), subtraction 
(LTS), or a move of the P register’s contents to the ac¬ 
cumulator (LTP). The accumulation can be supplemented 
by the data move function (LTD). Another combination 
(MPYA/MPYS) provides the accumulation of the previous 
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product along with the execution of the multiplier to 
generate a new product. This combination is particularly 
useful in adaptive filtering techniques such as those em¬ 
bodied in the least-mean-square (LMS) algorithm. 4 ’ 15 The 
implementation of an adaptive filter by means of these in¬ 
structions will be described in detail in the section on 
applications. 

Block moves. The TMS320C25 provides six instructions 
for data and program block moves and transfers of data via 
the I/O ports. When these instructions are pipelined by 
means of the repeat instruction, significantly higher through¬ 
put is achieved—the pipelining results in a transfer rate of 160 
million bits per second. 

The BLKD instruction moves a block within data 
memory, and the BLKP instruction moves a block from 
program memory to data memory. Block transfers between 
program and data memory spaces can also be implemented 
with the TBLR and TBLW (table read and table write) in¬ 
structions. The advantages of TBLR and TBLW are that 
they allow the source address as well as the destination ad¬ 
dress to be determined during programming and that they 
permit the data to be transferred from data memory to pro¬ 
gram memory. The IN and OUT instructions permit data to 
be transferred between the I/O and data memory spaces. 
While the source address is determined by the prefetch 
counter, which is incremented on every cycle, the destina¬ 
tion address is determined by an auxiliary register whose 
contents can be modified in any of the previously specified 
ways. This permits sequential and contiguous data place¬ 
ment (*+,*-), sequential but noncontiguous data place¬ 
ment (*0+,*0-), or scrambled data placement 
(* BRO +, * BRO -). The value of these address modifica¬ 
tions during block data transfers becomes particularly ap¬ 
parent in the use of indexing with reverse-carry propagation 
to set up the data block in an FFT. The result is not only a 
savings in execution time but a savings in program memory 
space as well. 

Floating-point support. The TMS320C25 supports 
floating-point operations for applications requiring a large 
dynamic range. The NORM (normalization) instruction 
normalizes fixed-point numbers contained in the accumulat¬ 


or by performing left shifts. The LACT (load accumulator 
with shift specified by the T register) instruction denor- 
malizes a floating-point number by arithmetically left- 
shifting the mantissa through the input scaling shifter. The 
shift count, in this case, is the value of the exponent speci¬ 
fied by the four low-order bits of the T register. ADDT and 
SUBT instructions (add to/subtract from accumulator with 
shift specified by the T register) have been provided to allow 
additional arithmetic operations. 

TMS320C25 hardware 

The most important task for a hardware designer is inter¬ 
facing the DSP device to the rest of the system as inexpen¬ 
sively as possible. Here, we will discuss the TMS320C25’s 
interfacing capabilities. 


Figure 5. Example of indirect auxiliary register addressing. 
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Figure 6. Minimal configuration for external program 
memory. 


System configurations. The flexibility of the TMS320C25 
allows systems configurations that satisfy a broad range of 
application requirements. The TMS320C25 can be con¬ 
figured as 

• a stand-alone system (that is, as a single processor using 
4K words of on-chip ROM and 544 words of on-chip 
RAM), 

• part of a parallel multiprocessing system (two or more 
TMS320C25s) with shared global data memory, or 

• a coprocessor for a host processor. 

The stand-alone system interface consists of a 16-bit par¬ 
allel data bus, a 16-bit address bus, three pins for memory 
space select, and various system control signals. In Figure 6, 
an external data RAM and a PROM/EPROM have been 
added to the basic stand-alone system. The READY signal 
is used for wait-state generation for communicating with 
slower off-chip memories. All the memories and I/O 
devices are directly controlled by the TMS320C25, thus 
minimizing external hardware requirements. 

Parallel multiprocessing and host/coprocessor systems 
take advantage of the TMS320C25’s direct memory access 
and global memory configuration capabilities. 


Direct memory access. The TMS320C25 supports direct 
memory access to its external program/data memory and 
I/O space through its HOLD and HOLDA signals. Direct 
memory access can be used for multiprocessing: Execution 
on one or more processors can be temporarily halted to 
allow another processor to read from or write to the halted 
processor’s local off-chip memory. Here the multiprocessing 
is typically performed in a master/slave configuration. The 
master can initialize the slave by downloading a program 
into its program memory space or provide the slave with the 
data needed to complete a task. 

In a direct memory access scheme, the master may be a 
general-purpose CPU, a TMS320C25, or perhaps even an 
A/D converter. A master TMS320C25 takes co mplete c on- 
trol of the slave’s external memory by asserting HOLD low 
through its external flag (XF). This causes the slave to place 
its address, data, and control lines in a hig h-imped ance 
state. By asserting RS in conjunction with HOLD, the 
master processor can load the slave’s local program memory 
with the necessary initialization code on reset or power-up. 
The two processors can be synchronized through use of the 
SYNC pin to make the transfer over the memory bus faster 
and more efficient. 

After control of the slave’s buses is given to th e master 
processor, the slave alerts the master b y ass erting HOLDA. 
This signal can be tied to the master’s BIO pin. The slave’s 
XF pin can be used to indicate to the master when the slave 
has finished performing its task and needs to be repro¬ 
grammed or given additional data to continue processing. 

In a multiple-slave configuration, the priority of each slave’s 
task can be determined by tying the slave’s XF signals to the 
appropriate INT pin on the master. 

A PC environment provides an example of a direct 
memory access scheme in which the system bus is used for 
data transfer. In this configuration, either the master CPU 
or a disk controller may place data on the system bus for 
downloading into the local memory of the TMS320C25. 
Here the TMS320C25 acts like a peripheral processor with 
multifunction capability. In a speech application, for exam¬ 
ple, the master can load the TMS320C25’s program 
memory with algorithms to perform tasks such as speech 
analysis, synthesis, or recognition, and its data memory 
with the required speech templates. In a graphics applica¬ 
tion, the TMS320C25 can serve as a dedicated graphics 
engine, programs can be stored in ROM or downloaded via 
the system bus into program RAM. Again, data can come 
from PC disk storage or be provided directly by the master 
CPU. In this configuration, decode and arbitration logic is 
used to control the direct memory access. When the address 
on the system bus resides in the local memory of the periph¬ 
eral TMS320C25, this logic asserts the HOLD signal while 
sending the master a not-ready indication to allow wait 
states. After the TMS320C 25 acknow ledges the direct 
memory access by asserting HOLDA, READY is asserted 
and the information is transferred. 

Global memory. In some digital signal processing tasks, 
the algorithm being implemented can be divided into sec¬ 
tions and a processor dedicated to each. In this case, the 
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first and second processors can share global data memory, 
as can the second and third, the third and fourth, and so 
on. Arbitration logic may be required to determine which 
section of the algorithm will execute and which processor 
will have access to the global memory. The dedication of 
each processor to a distinct section of the algorithm makes 
pipelined execution—and thus higher throughput—possible. 

External memory can be divided into global and local sec¬ 
tions. Special registers and pins on the TMS320C25 allow 
multiple processors to share up to 32K words of global data 
memory. This facilitates efficient “shared data” multi¬ 
processing, in which data are transferred between two or 
more processors. Unlike a direct memory access scheme, 
reading or writing global memory does not require one of 
the processors to be halted. 

TMS320C25 development tools 
and support 

A digital signal processor is essentially an application- 
specific microprocessor or microcomputer. Like any micro¬ 
processor, it needs good development tools and technical 
support—no matter how impressive its performance or how 
easy its interfacing to other devices, it cannot be easily 
designed into systems without such tools and support. In 
developing an application, a designer encounters problems 


can be executed by the simulator, emulator, or the TMS- 
320C25 processor. The macro assembler/linker is currently 
available for the VAX/VMS, TI PC/MS-DOS, and IBM 
PC/PC-DOS operating systems. 

Simulator. The simulator is a software program that 
simulates TMS320 operations to allow program verification. 
Its debug mode enables the user to monitor the state of the 
simulated TMS320 while his program is executing. The 
simulator uses the object code produced by the macro 
assembler/linker. During program execution, the internal 
registers and memory of the simulated TMS320 are modi¬ 
fied as each instruction is interpreted by the host computer. 
Once program execution is suspended, the internal registers 
and the program and data memories can be inspected and 
modified. The simulator is currently available for the 
VAX/VMS, TI PC/MS-DOS, and IBM PC/PC-DOS oper¬ 
ating systems. 

Hardware tools. Tools are provided for in-circuit emula¬ 
tion and hardware program debugging such as breakpoint¬ 
ing and tracing so that DSP algorithms can be developed 
and tested in a real-product environment. 

Evaluation module. The evaluation module, or EVM, is a 
stand-alone board that contains all the hardware tools 


No matter how impressive its performance or how easy its interfacing to other 
devices, a digital signal processor cannot be designed into systems without good 
development tools and vendor support. 


and needs to ask questions. Often the tools and vendor sup¬ 
port given him are the difference between the success and 
failure of his project. 

The TMS320C25 is supported by many development 
tools. 16 These tools range from inexpensive modules for ap¬ 
plication evaluation and benchmarking to an assembler/ 
linker and software simulator to a full-capability hardware 
emulator. 

Software tools. An assembler/linker and software simu¬ 
lator that enable users to develop and debug TMS320 DSP 
algorithms are available for the TI PC, IBM PC, and VAX. 

Assembler/linker. The macro assembler translates assem¬ 
bly language source code into executable object code. It 
allows the programmer to work with mnemonics rather than 
hexadecimal machine instructions and to reference memory 
locations with symbolic addresses. It supports macro calls 
and definitions along with conditional assembly. The linker 
permits a program to be designed and implemented in 
separate modules that are later linked to form the complete 
program. The linker resolves external definitions and 
references for relocatable code, creating an object file that 


needed to evaluate the TMS320C25 and that provides in- 
circuit emulation of it. The EVM’s firmware package con¬ 
tains a debug monitor, an editor, an assembler, a reverse 
assembler, and software communication to two EIA ports. 
These ports allow the EVM to be connected to a terminal 
and to either a host computer or a line printer. The EVM 
accepts either source or object code downloaded from the 
host computer. Its resident assembler converts incoming 
source text into executable code in just one pass by auto¬ 
matically resolving labels after the first assembly pass is 
completed. When a session is finished, code is saved via the 
host computer interface. 

Software development system. The SWDS is a plug-in 
card for the TI PC and IBM PC that provides the same 
functionality as the EVM. 

Emulator. The XDS (Extended Development System) is 
an emulator providing full-speed in-circuit emulation with 
real-time hardware breakpointing and tracing and program 
execution capability from target memory. The XDS allows 
integration of hardware and software modules in the debug 
mode. By setting breakpoints based on internal conditions 


December 1986 







The TMS320C25 



or external events, the XDS user can suspend execution of 
the program and give control to the debug mode. In the 
debug mode, he can inspect and modify all registers and 
memory locations. Single-step execution is available. Full- 
trace capabilities at full speed and a reverse assembler that 
translates machine code back into assembly instructions also 
increase debugging productivity. The XDS system is de¬ 
signed to interface with either a terminal or a host com¬ 
puter. Object code generated by the assembler/linker can be 
downloaded to the XDS and then controlled through a 
terminal. 

Analog interface board. The AIB is an analog-to-digital 
(A/D) and digital-to-analog (D/A) conversion board that 
can be used in conjunction with the EVM or XDS. It can 
also be used in an educational environment to help familiar¬ 
ize the user with real-world digital signal processing tech¬ 
niques. The AIB includes A/D and D/A converters with 
12-bit resolution as well as antialiasing and smoothing filters 
that have a cut-off frequency programmable from 4.7 kHz 
to 20 kHz. 

In addition to the above design tools, development sup¬ 
port includes 

• the Digital Filter Design Package, which runs on both 
TI and IBM PCs and which allows the user to design digital 
filters (low-pass, high-pass, band-pass, and band-stop types) 
using a menu-driven approach, 

• TI Regional Technology Centers staffed with qualified 
engineers who provide technical support and design services, 

• access to third parties with DSP expertise in various ap¬ 
plication areas, 

• a series of DSP books covering DSP theory, algorithms, 
and applications and TMS320 implementations, 4,5,7 

• documentation such as user’s guides, 10 ' 12 data sheets, a 
development support reference guide, 16 and comprehensive 
application reports, 4 and 

• a technical support hotline and a bulletin board service. 


TMS320C25 applications 

The TMS320C25 is designed for real-time DSP and 
other computation-intensive tasks in telecommunications, 
graphics, image processing, high-speed control, speech pro¬ 
cessing, instrumentation, and numeric processing. In these 
applications, the TMS320C25 provides an excellent means 
for executing signal processing algorithms such as fast Four¬ 
ier transforms (FFTs), digital filters, frequency synthesizers, 
correlators, and convolution routines. It can also execute 
general-purpose functions since it includes bit-manipulation 
instructions, block data move capabilities, large program 
and data memory address spaces, and flexible memory 
mapping. 

Since digital filters are used in so many DSP applications, 
let us examine them as a prelude to our discussion of 
TMS320C25 applications. 


Digital filtering. Filters are often implemented in digital 
signal processing systems. Such filters fall into two 
categories: Finite impulse response (FIR) filters and infinite 
impulse response (HR) filters. 4,6 For both types of filter, the 
coefficients of the filter (weighting factors) may be fixed or 
adapted during the course of the signal processing. The 
TMS320C25 reduces the execution time of all filters by vir¬ 
tue of its 100-ns instruction cycle time and optimized in¬ 
structions for filter operations. 

As we stated earlier, the FIR filter is simply the sum of 
products in a sampled data system (see Equation 1 again). 

A simple implementation of the FIR filter uses the MACD 
instruction (multiply/accumulate and data move) for each 
filter tap and the RPT/RPTK instruction to repeat the 
MACD for each tap. Thus, a 256-tap FIR filter can be im¬ 
plemented as 

RPTK 255 

MACD *-,COEFFP 

Here, the coefficients can be stored anywhere in program 
memory (in the reconfigurable on-chip RAM, in the on-chip 
ROM, or in external memories). When the coefficients are 
stored in on-chip ROM or externally, the entire on-chip data 
RAM can be used to store the sample sequence. This allows 
filters of Up to 512 taps to be implemented. Execution of 
the filter will be at full speed, or 100 ns per tap, as long as 
the memory (either on-chip RAM or high-speed external 
RAM) supports full-speed execution. 

Up to this point, we have assumed that the filter coeffi¬ 
cients are fixed from sample to sample. If the coefficients 
are adapted or updated with time, as they are in adaptive 
filters for echo cancellation, 4,15 the DSP algorithm requires 
a greater computational capacity from the processor. To 
adapt or update the coefficients, usually with each sample, 
the TMS320C25 uses three instructions—multiply and 
add/substract previous product to/from accumulator 
(MPYA/MPYS), zero-out low-order accumulator bits and 
load high-order accumulator bits with data (ZALR), and 
store high-order bits of accumulator to data memory 
(SACH). The method it uses to adapt the coefficients is the 
least-mean-square, or LMS, algorithm, which can be ex¬ 
pressed as 

b k (i+ 1) = b k (i) + 2B [e(i) ■ x(i-k)], (2) 

where b k (i+ 1) is the weighting coefficient for the next sam 
pie period, b k (i) is the weighting coefficient for the present 
sample period, B is the gain factor or adaptation step size, 
e(i) is the error function, and x(i-k) is the input of the 
filter. 

In an adaptive filter, the coefficients b k (/') must be up¬ 
dated to minimize the error function e(i), which is the dif¬ 
ference between the output of the filter and a reference 
signal. Quantization errors arising during coefficient up¬ 
dating can strongly affect the performance of the filter, but 
these errors can be minimized if the updated values are ob¬ 
tained by rounding rather than truncating. For each coeffi¬ 
cient in the filter at a given point in time, the factor 
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2 * B *e(i) is a constant .This factor can be computed once 
and stored in the T register for each of the updates. This 
reduces the computational requirement to one mul¬ 
tiply/accumulate plus rounding. Without the new instruc¬ 
tions, the adaptation of each coefficient would take five in¬ 
structions corresponding to five clock cycles, as the follow¬ 
ing instruction sequence shows: 


LRLK 

AR2.C0EFFD 

; LOAD ADDRESS OF COEFFICIENTS 

LRLK 

AR3,LASTAP 

; LOAD ADDRESS OF DATA SAMPLES 

LARP 

AR2 


LT 

ERRF 

; errf - 2*B*e(i) 

ZALH 

*,AR3 

; ACC - bk(i)*2**16 

ADD 

ONE,15 

; ACC - bk(i)*2**16 + 2**15 

MPY 

*-,AR2 

APAC 


; ACC - bk(i)*2**16 
+ errf*x(i-k) + 2**15 

SACH 

*+ 1 

; SAVE bk(i+1). 


When the MPYA and ZALR instructions are used, the 
adaptation reduces to three instructions corresponding to 
three clock cycles, as shown below: 


LRLK 

AR2.C0EFFD ; 

; LOAD ADDRESS OF COEFFICIENTS 

LRLK 

AR3,LASTAP 

; LOAD ADDRESS OF DATA SAMPLES 

LARP 

AR2 


LT 

ERRF 

; errf = 2*B*e(i) 

ZALR 

*,AR3 ; 

; ACC = bk(i)*2**16 + 2**15 

MPYA 

*-,AR2 

; ACC - bk(i)*2**16 
+ errf*x(i-k) + 2**15 



PREG = errf*x(i-k+1) 

SACH 

*+ ■ 

SAVE bk(i+1). 


Note that the processing order has been slightly changed to 
incorporate the use of the MPYA instruction. This is due to 
the fact that the accumulation performed by the MPYA is 
the accumulation of the previous product. 

We have now seen the basic code for a FIR filter tap and 
a coefficient update. Figure 7 shows a routine to filter a 
signal and update the coefficients for a 256-tap adaptive 
FIR filter. Note that for each tap one instruction cycle is 
needed to perform the FIR filter (i.e., to execute a MACD), 
three instruction cycles are needed to update the filter coef¬ 
ficients, and 33 instruction cycles are needed for overhead. 
Therefore, the total number of execution cycles needed for 
the routine is 33 + 4 n, where n is the filter length. Also, 
note that data memory and program memory requirements 
are 5 + In and 30 + 3 n words, respectively. For adaptive 
filters, the filter length is restricted by both execution time 
and memory. There is obviously more processing to be com¬ 
pleted per sample due to the adaptation, and the adaptation 



TITL 

'ADAPTIVE FILTER' 


DEF 

ADPFIR 



DEF 

X ,Y 


* THIS 

256-TAP ADAPTIVE FIR FILTER USES ON-CHIP MEMORY BLOCK 

* BO FOR COEFFICIENTS AND 

BLOCK B1 FOR DATA SAMPLES. THE 

* NEWEST INPUT SHOULD BE 

IN MEMORY LOCATION X WHEN CALLED. 

* THE 

OUTPUT 

WILL BE IN MEMORY LOCATION Y WHEN RETURNED. 

* ASSUME THAT THE DATA PAGE IS 0 WHEN THE ROUTINE IS CALLED. 

COEFFP 

EQU 

>FF00 

; BO PROGRAM MEMORY ADDRESS 

COEFFD 

EQU 

>0200 

; BO DATA MEMORY ADDRESS 

ONE 

EQU 

>7A 

; CONSTANT ONE (DP=0) 

BETA 

EQU 

>7B 

; ADAPTATION CONSTANT (DP=0) 

ERR 

EQU 

>7C 

; SIGNAL ERROR (DP=0) 

ERRF 

EQU 

>7D 

; ERROR FUNCTION (DP=0) 

Y 

EQU 

>7E 

; FILTER OUTPUT (DP=0) 

X 

EQU 

>7F 

; NEWEST DATA SAMPLE (DP=0) 

FRSTAP 

EQU 

>0300 

; NEXT NEWEST DATA SAMPLE 

LASTAP 

EQU 

>03FF 

; OLDEST DATA SAMPLE 

* FINITE IMPULSE RESPONSE (FIR) FILTER. 

ADPFIR 

CNFP 


i CONFIGURE BO AS PROGRAM: 


MPYK 

0 

; Clear the P register. 


LAC 

ONE,14 

; Load output rounding bit. 


LARP 

AR3 



LRLK 

AR3,LASTAP 

; Point to the oldest sample. 

FIR 

RPTK 

255 



MACD 

COEFFP,*- 

; 256-tap FIR filter. 


CNFD 


; CONFIGURE BO AS DATA: 


APAC 




SACH 

M 

; Store the filter output. 


NEG 




ADD 

X, 15 

; Add the newest input. 


SACH 

ERR, 1 

; err(i) = x(i) - y(i) 

* LMS 

ADAPTATION OF FILTER COEFFICIENTS. 


LT 

ERR 



MPY 

BETA 



PAC 


; errf(i) = beta * err(i) 


ADD 

ONE,14 

ROUND THE RESULT. 


SACH 

ERRF,1 



MAR 

*+ 



LAC 

X 

; INCLUDE NEWEST SAMPLE. 


SACL 

* 



LRLK 

AR2,COEFFD 

POINT TO THE COEFFICIENTS. 


LRLK 

AR3,LASTAP 

POINT TO THE DATA SAMPLES. 


LT 

ERRF 


* 

MPY 

*-,AR2 

P = 2*beta*err(i)*x(i-255) 

ADAPT 

ZALR 

*, AR3 

LOAD ACCH WITH b255(i) & ROUND. 


MPYA 

*-,AR2 

b255(i+1) = b255(i) + P 

* 



P = 2*beta*err(i)*x(i-254) 


SACH 

*+ 

STORE b255(i+1). 


ZALR 

*, AR3 

LOAD ACCH WITH b254(i) S ROUND. 


MPYA 

*-,AR2 

b254(i+1) = b254(i) + P 

* 



P = 2*beta*err(i)*x(i-253) 


SACH 

* + 

STORE b254(i+l). 


ZALR 

*, AR3 

LOAD ACCH WITH b253(i) & ROUND. 


MPYA 

»-,AR2 

b253(i + 1) = b253(i ) + P 

* 



P = 2*beta*err(i)*x(i-252) 


SACH 

*+ 

STORE b253(i+1). 

* 

ZALR 

*, AR3 

LOAD ACCH WITH bl(i) & ROUND. 


MPYA 

*-,AR2 

bl(i+1) = b1(i) + P 

* 



P = 2*beta*err(i)*x(i-0) 


SACH 

* + 

STORE b 1 (i + 1 ) . 


ZALR 

*, AR3 

LOAD ACCH WITH bO(i) & ROUND. 


APAC 

*-,AR2 

b0(i + 1) = b0(i ) + P 


SACH 

*+ 

STORE b0(i + 1). 


RET 


RETURN TO CALLING ROUTINE. 


Figure 7 . 256 -tap adaptive FIR filter routine. 
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itself dictates that the coefficients be stored in the ^con¬ 
figurable block of on-chip RAM. Thus, an adaptive filter 
with no external data memory is limited to 256 taps. 

Telecommunications applications. Digital signal process¬ 
ing will be more extensively used in telecommunications as it 
evolves toward all-digital networks. 17 Below, we discuss 
several typical uses of the TMS320C25 in telecommunica¬ 
tions applications. 

Echo cancellation. In echo cancellation, an adaptive FIR 
filter performs the modeling routine and signal modifica¬ 
tions needed to adaptively cancel the echo caused by im¬ 
pedance mismatches in telephone transmission lines. The 
TMS320C25’s large on-chip RAM of 544 words and on- 
chip ROM of 4K words allow it to execute a 256-tap adap¬ 
tive filter (32-ms echo cancellation) without external data or 
program memory. 

High-speed modems. For high-speed modems, the 
TMS320C25 can perform functions such as modulation 
and demodulation, adaptive equalization, and echo 
cancellation. 18 ’ 19 

Voice coding. Voice-coding techniques such as full-duplex, 
32,000-bit-per-second adaptive differential pulse-code 
modulation (CCITT G.721), CVSD, 16,000-bit-per-second 
subband coding, and linear predictive coding are frequent¬ 
ly used in voice transmission and storage. The speed of the 
TMS320C25 in performing arithmetic and its normaliza¬ 
tion and bit-manipulation capabilities enable it to imple¬ 
ment these functions, usually within itself (i.e., with no ex¬ 
ternal devices). 

Graphics and image processing applications. In these ap¬ 
plications, a signal processor’s ability to interface with a 
host processor is important. The TMS320C25 multi¬ 
processor interface enables it to be used in a variety of 
host/coprocessor configurations. Graphics and image pro¬ 
cessing applications can use the TMS320C25’s large directly 
addressable external data space and global memory capabil¬ 
ity to allow graphical images in memory to be shared with a 
host processor, thus minimizing data transfers. The 
TMS320C25’s indexed indirect addressing modes allow ma¬ 
trices to be processed row by row when matrix multiplica¬ 
tion is performed for 3-D image rotation, translation, and 
scaling. 

High-speed control applications. These applications use 
the TMS320C25’s general-purpose features for bit-test and 
logical operations, timing synchronization, and fast data 
transfers (10 million 16-bit words per second). They use the 
TMS320C25 in closed-loop systems for control signal condi¬ 
tioning, filtering, high-speed computing, and multichannel 
multiplexing. The following examples demonstrate typical 
control applications. 


Disk control. In disk drives, a closed-loop actuation 
mechanism positions the read/write heads over the disk 
surface. Accurate positioning requires various signal con¬ 
ditioning tasks to be performed. The TMS320C25 can 
replace costly bit-slice, custom, and analog solutions in 
performing such tasks as compensation, filtering, and 
fine/coarse tuning. 

Robotics. The TMS320C25’s digital signal processing and 
bit-manipulation power, coupled with its host interface, 
allow it to be useful in robotics control. The TMS320C25 
can replace both the digital controllers and the analog signal 
processing hardware a robot needs to communicate to a 
central host processor, and it can perform the numerically 
intensive control functions typical of robotic applications. 

Instrumentation. Instruments such as spectrum analyzers 
often require a large data memory space and a processor 
capable of performing long-length FFTs and generating 
high-precision functions with minimal external hardware. 
The TMS320C25 fulfills these requirements. 

Numeric processing applications. Numeric and array pro¬ 
cessing applications benefit from the TMS320C25’s perfor¬ 
mance. The device’s high throughput and its multi¬ 
processing and data memory expansion capabilities make it 
a low-cost, easy-to-use replacement for a typical bit-slice ar¬ 
ray processor. 

Benchmarks. The TMS320C25 has demonstrated im¬ 
pressive performance of benchmarks representing common 
DSP routines and applications. Table 3 shows this perfor¬ 
mance. 


T he TMS320C25 digital signal processor is the newest 
member of the TMS320 family. It is a pin- 
compatible, CMOS version of the TMS32020 but 
offers several enhancements of that device—a 100-ns in¬ 
struction cycle time, 4K words of on-chip masked ROM, 
eight auxiliary registers, an eight-level hardware stack, and a 
double-buffered serial port. It also enhances the TMS32020 
instruction set to support adaptive filtering, extended- 
precision arithmetic, bit-reversed addressing, and faster 
I/O. 

The TMS320C25’s multiprocessor capability, large 
memory spaces, and general-purpose features allow it to be 
used in a variety of systems, including ones currently 
employing costly bit-slice processors or custom ICs. Ii 
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Table 3. 

TMS320C25 benchmarks. 

DSP ROUTINES/APPLICATIONS HHHE PERFORMANCE 

FIR filter tap 

100 ns per tap 

256-tap FIR filter sample rate 

37 kHz 

LMS adaptive FIR filter tap 

400 ns per tap 

256-tap adaptive FIR filter sample rate 9.5 kHz 

Biquad filter element 

1 pS 

Echo canceller 

32 ms per single chip 

(with internal memory) 

32,000-bit/s CCITT ADPCM 

1 channel full-duplex, 
single-chip (with 
internal memory) 

16,000-bit/s subband coding 

2 channels full-duplex, 
single-chip (with 0.5K 
external data memory) 

2400-bit/s LPC-10 coding 

2 channels full-duplex, 
single-chip (with 2K 
external data memory) 
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The 

Motorola DSP56000 
Digital Signal Processor 

The DSP56000 brings 10.25-MIPS 
performance to digital signal 
processing and retains enough 
similarities to other Motorola 
microprocessors to make it easy to 
learn and program. 

Kevin L. Kloker 
Motorola, Inc. 


T he Motorola DSP56000 is a high-performance, user- 
programmable digital signal processor implemented 
in high-density, low-power CMOS technology. The 
first member of a new family of special-purpose micropro¬ 
cessors designed specifically for digital signal processing ap¬ 
plications, 1,2 it has a highly parallel architecture and can 
execute 10.25 million instructions per second. Its instruction 
set supports the real-time processing of 24-bit data with 56 
bits of internal arithmetic precision. Combining a core pro¬ 
cessor, RAM, ROM, peripheral interfaces, and a memory 
expansion interface on a single, 88-pin chip, the DSP56000 
represents the state of the art in digital signal processor 
design. 

Though the DSP56000 has hardware and software 
features specifically added to support digital signal process¬ 
ing, it shares enough similarities with other Motorola 
microprocessors and single-chip microcomputers that it can 
be used as a very fast microprocessor in any application that 
requires high-speed calculations or real-time response. For 
example, it can be used as a controller in applications 
previously served by bit-slice processors. Providing both 
single-chip and expanded bus operation, it also fits many 
applications beyond the capabilities of 8- and 16-bit 
microcontrollers. 

The digital signal processing 
environment 

Digital signal processing is concerned with the real-time 
processing of digitized analog signals, which are discrete in 


both amplitude and time. 3 To illustrate the environment of 
a typical DSP system, I will consider a simple digital filter¬ 
ing application (Figure 1). The system shown is intended to 
filter an analog signal using digital means. The system starts 
with an analog input signal x(t), which is converted to a 
sampled digital signal x(ri) by an analog-to-digital, or A/D, 
converter. As long as the system samples the analog input at 
a frequency fs that is at least twice the information band¬ 
width of that input, all information present in the original 
analog signal is contained in the digital signal. No signal in¬ 
formation is lost. However, the quality of the conversion to 
a discrete amplitude signal introduces quantization noise 
into the system. The sampled digital signal—that is, its 
signal-to-quantization-noise ratio, or SQNR—is a function 
of the A/D converter’s accuracy. Choosing the resolution 
(number of bits) and linearity of the A/D converter is a 
trade-off between cost and the SQNR requirements of the 
system. 4 

Once the system has obtained a faithful digital represen¬ 
tation of the analog signal, it can filter that representation 
in the digital signal processor. The digital signal processor 
stores the current A/D sample and N -1 previous samples 
in a sample shift register. The DSP56000, however, stores 
this data in a RAM and simulates the shift register function 
by modifying memory address pointers. The set of At filter 
coefficients h{i), i = 0,1, , N-l, must also be stored 

in a RAM or ROM; these coefficients determine the impulse 
response and the filter characteristics. A larger At gives a 
longer impulse response and generally produces filters with 
sharper roll-off, greater stopband attenuation, and less fre- 
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quency ripple. It does so at the expense of more calculations 
per output sample and more storage for sample data and 
coefficients. Since the order N of the filter is finite and no 
feedback path exists to sustain a nonzero filter output given 
a zero input, the filter is called an Nth order, finite impulse 
response, or FIR, digital filter. The FIR filtering operation 
requires N multiplies and N- 1 additions to compute an 
output y(ri) each time the input signal is sampled. This 
operation is the kernel of digital signal processing algo¬ 
rithms. The DSP56000 contains a parallel hardware multi¬ 
ply/accumulate circuit that completes this operation in 97.5 
ns. Repeated use of the multiply /accumulate operation pro¬ 
duces a sum-of-products result, which is the FIR filter out¬ 
put y(ri). This output is converted to an analog signal y(t) by 
a digital-to-analog, or D/A, converter. The analog output 
y(t) is now a filtered version of the analog input x(t). 

The operation of the system shown in Figure 1 is 
straightforward. A sample is taken, a filter calculation is 
performed, and a filter output is sent to the D/A converter. 
The data samples are then time-shifted (delayed) by one 
sampling period and the operations are repeated in the next 
sampling period. A large amount of data must be stored, 
processed, and time-shifted during each sampling period. In 
practice, digital signal processing systems can be much more 
complex than this simple example. However, the example 
shows what characterizes every digital signal processing en¬ 


vironment—multiply/accumulates, time shifts, and in¬ 
put/output in real time. 

DSP advantages. The primary advantage of digital signal 
processing is that it transforms analog functions into digital 
hardware. It extends the advantages of digital circuitry- 
high density, precision, stability, and testability—to those 
parts of systems previously served by analog components. 
Digital signal processing also transforms analog functions in 
hardware into digital signal processing algorithms. Tradi¬ 
tional analog functions such as filters are replaced by their 
digital equivalents in software form. Software is inherently 
flexible and thus is better suited to complex systems. Finally, 
digital signal processing implements applications that analog 
signal processing cannot. It makes possible digital speech 
transmission, storage, and synthesis for communications 
systems, for example. 5 

DSP limitations. DSP users cite the inability of existing 
devices to provide high processing speed and large memory 
sizes as one of the most frequent system limitations. Some 
DSP applications involve sampling rates of up to 100 MHz 
and can require hundreds of millions of multiply/accumu- 
lates per second. Another problem exists as well. Because 
DSP is a new discipline, engineers are not familiar with it 
and require training and experience in it. To apply digital 
signal processing techniques, they must rethink traditional 
product development approaches. 


j Digital signal processor 



Figure 1. A simple digital filtering system. 
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DSP56000 features 

The DSP56000 incorporates many new features designed 
to overcome the limitations of earlier DSP devices. It ad¬ 
vances the state of the art in three areas—speed, precision, 
and integration. 

Speed. Invariably, DSP users identify processor speed and 
system performance in a real-time, I/O-intensive environ¬ 
ment as the most important factors in their designs. The 
speed with which a digital signal processor can execute DSP 
algorithms is critical to the realization of real-time opera¬ 
tion. The DSP56000 achieves 10.25 MIPS and 10.25 million 
multiplies and accumulates per second, record highs for its 
class of device. It executes many DSP benchmarks two to 
four times faster than other recently announced DSP 
devices. 6-8 To achieve this performance, the DSP56000’s 
designers developed a unique multiplier/accumulator ALU 
and a multiple-bus architecture. They also designed the in¬ 
struction set and interrupt structure to minimize software 
overhead in real-time systems. 

Precision. The DSP56000 provides the greater data preci¬ 
sion needed for advanced digital signal processing applica¬ 
tions. Users often ask for more than 16-bit data, since they 
often employ double-precision arithmetic, data scaling, and 
overflow checks to maintain 16-bit precision. Yet many 
users do not need the dynamic range of 32-bit floating-point 
arithmetic, since the majority of data converters they use 
are 16 bits or less. Hence, the DSP56000’s designers selected 
a 24-bit fixed-point data format, which provides 24-bit 
precision without sacrificing system speed or silicon. The 
24-bit data word provides 144 dB of external dynamic 
range, sufficient for most real-world applications. The 
designers also chose 56-bit accumulators internal to the data 
ALU to provide 336 dB of internal dynamic range. With 
56-bit accumulation, no precision is lost during intermediate 
calculations. Twenty-four-bit precision provides eight bits 
of margin against overflow, round-off, and truncation er¬ 
rors when 16-bit data are being processed and eliminates the 
need for extra processing steps to maintain maximum preci¬ 
sion. The larger data size maximizes speed by allowing all 
calculations to be performed in single-precision fashion. 

Integration. A typical DSP system such as that shown in 
Figure 1 contains two independent data memories (one for 
data samples and one for coefficients), a separate program 
memory, and a parallel or serial interface to peripheral 
devices such as A/D and D/A converters. Anticipating such 
system needs, the DSP56000 includes five on-chip 
memories, three on-chip peripherals, and a full-speed 
memory expansion port. Putting ample memory and 
peripheral resources on the chip maximizes speed and 
minimizes system chip count. By integrating the elements of 
a DSP system in a single CMOS chip, the DSP56000 pro¬ 
vides a cost-effective solution to DSP users. Many DSP 
systems incorporate host microprocessors and additional 
DSP ICs, memory, and peripherals. By providing both 
parallel and serial input/output capability in an 88-pin 
package, the DSP56000 supports such expanded systems. 


DSP56000 architecture 

A block diagram of the DSP56000 architecture is shown 
in Figure 2. The architecture consists of a core processor 
that executes the DSP56000 instruction set and other on- 
chip resources such as memory and peripherals that provide 
storage and I/O capability to the core processor. On-chip 
memory and peripherals are not considered part of the core 
and may vary from one device family member to another. 
The chip’s pins interface both the core processor and on- 
chip peripherals to external devices. 

Core processor. The core processor consists of three 
separate execution units—the data ALU, the address ALU, 
and the program controller—connected by multiple buses. 
These units operate in a parallel rather than in a pipelined 
fashion; i.e., each execution unit works on the same instruc¬ 
tion at the same time. This is in contrast to heavily pipelined 
processors, which work on a large number of different in¬ 
structions at the same time. 9 - 10 Working in parallel to 
minimize latency, the three execution units provide all the 
resources the DSP56000 needs to execute instructions in a 
single instruction cycle. Here, I define an instruction cycle 
as two clock cycles; thus, an instruction cycle is 97.5 ns with 
a 20.5-MHz processor clock. Each execution unit is itself 
single-cycle and nonpipelined. The architecture of each exe¬ 
cution unit is different, being optimized to support its role 
in instruction execution. Each unit contains a set of regis¬ 
ters, arithmetic elements, and executable operations. These 
operations are register-oriented rather than memory- 
oriented. Each execution unit operates on its own local 
registers—source operands are read from registers within 
the execution unit and modified by arithmetic element 
operations, and the results are stored in registers within the 
same execution unit. Data transfers between execution units 
or between execution units and memory occur in parallel 
with internal execution unit operations. The single-cycle, 
register-oriented execution units are key to DSP56000 per¬ 
formance because they do not impose pipelining latency re¬ 
strictions on the user. 

Programming model 

The DSP56000 user programming model is shown in 
Figure 3. It provides numerous register resources that are 
partitioned into the three execution units of the processor. 
The instruction set is designed to allow flexible control of 
these parallel processing resources. Many instructions allow 
the programmer to keep each execution unit busy, thus 
enhancing program execution speed. The user can easily 
program the DSP56000 from the programming model 
without needing to refer to a complicated hardware block 
diagram. Providing this capability to the user is a significant 
achievement, since typical DSP IC programming resembles 
writing microcode for a bit-slice architecture. With a typical 
DSP IC, the user is often burdened with difficult program¬ 
ming resulting from the timing complexities of pipelined 
data paths. The DSP56000 takes a different approach. 
Because of its nonpipelined execution units, timing com- 
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Figure 2. DSP56000 block diagram. 


plexity is hidden from the user. No microcode is required 
and a conventional programming model, instruction set, 
and addressing mode definition can be used. Thus, 
DSP56000 programming resembles assembly language pro¬ 
gramming. The regularity of the programming model also 
makes possible high-level-language programming of the 
DSP56000. 

The status register (SR) format is shown in Figure 4. The 
condition code register (CCR) portion indicates the results 
of operations on data operands. The mode register (MR) 
portion contains information about the system state of the 
processor. 


The instruction set defines three separate memory spaces, 
x data, y data, and p program, which are each 65,536 loca¬ 
tions by 24 bits wide. The total addressing capability is 
196,608 24-bit words, or 589,824 bytes. All three memory 
spaces can be accessed in parallel in the same instruction cy¬ 
cle. The memory maps of these spaces are shown in Figure 5 
for the normal expanded operating mode. (One obtains 
other memory maps by changing the operating mode 
register, OMR.) Noncore resources (memory and peripher¬ 
als) are memory-mapped into these memory spaces to pro¬ 
vide a clean hardware and software interface with the core 
processor. 
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Data ALU 

The data ALU execution unit performs arithmetic and 
logical operations on data operands. It consists of 10 local 
registers, two shifter/limiter blocks, an accumulator shifter, 
and a multifunction multiplier/accumulator (MAC) ALU 
that has two 56-bit inputs and one 56-bit output. 11 The two 
56-bit inputs can be used for operations such as addition, 
subtraction, and comparison on 24-, 48-, or 56-bit numbers. 
During multiply and multiply/accumulate operations, one 
of the 56-bit inputs serves as the accumulator input while 
the other 56-bit input is reconfigured as a 24-bit multipli¬ 
cand and 24-bit multiplier input. The MAC ALU contains a 
24 x 24-bit parallel hardware multiplier/accumulator cir¬ 
cuit that provides 56-bit accumulation. The MAC ALU is 
not pipelined and performs all operations in a single 97.5-ns 
instruction cycle. This is in contrast to common two-stage 
pipelined architectures employing a multiplier separated 
from an adder by a product pipeline register. The product 
pipeline register adds an extra cycle of delay before the 
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Figure 6. Bit weight and alignment of fractional data. 


multiplier output is available. Since the MAC ALU has no 
product pipeline register, it exhibits no delay and executes 
many algorithms faster. 

The data ALU has four 24-bit general-purpose input 
registers and six special output registers organized as two 


56-bit (8 + 24 + 24-bit) accumulators. The input registers can 
be read or written as 24- or 48-bit data. The two 56-bit ac¬ 
cumulator registers can be accessed in numerous ways as 
24-, 48-, or 56-bit data. The MAC ALU output is always 
stored in an accumulator register. Data ALU operations 
support several different levels of precision, as indicated in 
Figure 6. Word operands are 24 bits long, long word 
operands are 48 bits long, and accumulator operands are 56 
bits long. The MAC ALU supports the signed, two’s-com- 
plement fractional data representation commonly used in 
digital signal processing algorithms. Fractional data, d, has 
a range of - 1.0 < d < +1.0 and is identical to signed in¬ 
teger data except for the placement of the binary radix 
point. In practice, fractional and integer arithmetic are 
equivalent for all operations except multiply and divide. An 
integer multiplication product can be formed by a one-bit 
right shift of a fractional product, for example. 

The data ALU provides the standard microprocessor set 
of logical and shifting operations to support a variety of 
algorithms. Logical operations are performed in the MAC 
ALU. A 56-bit accumulator shifter is included on one of the 
56-bit MAC ALU inputs for one-bit left or right shifts. The 
data ALU does not contain a barrel shifter, but the MAC 
ALU can be used for multibit shifting operations. By 
multiplying the 24-bit data by a constant or variable, the 


34 


IEEE MICRO 







































MAC ALU can perform left or right shifts of from 1 to 23 
bits in a single instruction. The left-shifted result is available 
in the least significant portion of the accumulator and the 
right-shifted result is available in the most significant por¬ 
tion of the accumulator. This method is efficiently used by 
the instruction set to provide data normalization and denor¬ 
malization as well as bit-field insertion and extraction. 


Accumulator extension. The 56-bit accumulators store a 
complete 48-bit multiplication product plus eight bits of in¬ 
teger data called an accumulator extension. The extension 
portion of the 56-bit accumulator provides eight bits of pro¬ 
tection against overflow during intermediate calculations 
and allows at least 256 repetitive multiply /accumulate 
operations before an overflow can occur. The extension bits 
eliminate the need to scale down the input data to avoid ac¬ 
cumulator overflows arising from word growth in repetitive 
calculations. This is true because the fractional product, p, 
has a range of - 1.0 < p < +1.0 and the 56-bit ac¬ 
cumulator, a , has a range of - 256.0 < a < + 256.0. 

When word or long word data are written to an ac¬ 
cumulator register, the programmer may sign-extend its ex¬ 
tension portion and zero its least significant portion to form 
a valid 56-bit signed number. When 56-bit accumulator data 
are read out of the data ALU, the programmer may post¬ 
process the data by enabling two special shifter/limiter cir¬ 
cuits located between the accumulator registers and the data 
ALU outputs. In operation, a copy of the accumulator data 
is shifted left or right one bit if it has been enabled by the 
scaling mode bits in the status register. This allows block 
floating-point algorithms to be implemented for fast Fourier 
transforms and matrix manipulation. After shifting, the 
data are passed through an overflow protection circuit called 
a limiter. If the accumulator data can be stored in the 
destination without overflow, the limiter is disabled and the 
data are not modified. If the accumulator data cannot be 
stored in the destination without overflow (because the ex¬ 
tension portion of the accumulator is in use), the limiter is 
enabled and substitutes the maximum data value (having the 
same sign) that the destination can store without overflow. 
This technique, called saturation arithmetic, minimizes the 
overflow error by avoiding the sign change usually asso¬ 
ciated with binary overflow. 


Rounding. For many DSP algorithms, single-precision 
(24-bit) data ALU results are needed to minimize data 
storage requirements or to provide data for subsequent use 
as a multiplier input. Rounding the least significant portion 
of the accumulator into the most significant portion is bet¬ 
ter than truncation for maintaining maximum precision and 
avoiding introduction of a negative bias error. The 
DSP56000 provides “round to nearest even,” or RN, 
rounding, which is performed during multiply and round 
(MPYR), multiply/accumulate and round (MACR), and 
round (RND) instructions. 


Table 1. 

DSP56000 addressing modes. 



Address 

Assembler 

Addressing mode 

modifier 

syntax 

Register direct 

No 

Any register 
name 

Address register indirect 



No update 

Yes 

(R«) 

Postincrement by 1 

Yes 

(R«) + 

Postdecrement by 1 

Yes 

(Rn) — 

Postincrement by offset Nn 

Yes 

(Rn) + Nn 

Postdecrement by offset Nn 

Yes 

(Rn) - Nn 

Predecrement by 1 

Yes 

-(Rn) 

Index by offset N n 

Yes 

(Rn + Nn) 

(Rn and Nn are unchanged) 

Special 



Immediate data (24-bit) 

No 

#expr 

Absolute address (16-bit) 

No 

expr 

Immediate short data (8-, 12-bit) 

No 

#expr 

Short jump address (12-bit) 

No 

expr 

Absolute short address (6-bit) 

No 

expr 

I/O short address (6-bit) 

No 

expr 

n = register number 0 to 7 

expr = any valid assembler expression 




Address ALU 

The address ALU execution unit calculates addresses to 
locate data operands in memory. Two multiply/accumulate 
input operands are typically required by the data ALU at 
each instruction cycle. Data ALU results must be stored at a 
less frequent rate. The address ALU provides a flexible ad¬ 
dressing capability through a large address register set, 14 
addressing modes, and three types of address arithmetic. It 
consists of twenty-four 16-bit address registers, two 16-bit 
address arithmetic units, and an address output multiplexer. 
The address ALU can provide two independent memory ad¬ 
dresses at each instruction cycle and update them with two 
address register indirect addressing modes. A summary of 
the addressing modes and their assembler syntax is given in 
Table 1. 

The 24 address registers are organized into three sets of 
eight registers. The eight address registers Rn (n = 0 to 7) 
are used as address pointers to locate data operands in 
memory. The eight offset registers Nn are used as optional 
offset values to update the address registers. The offset 
registers may contain signed or unsigned data. The eight 
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Table 2. 

Address modifiers. 

Modifier register 

Address update arithmetic 

Mn value 

for address register Rn 

0 

Reverse-carry (bit-reversed) 

1 

Modulo 2 

2 

Modulo 3 


Modulo (Mn-t- 1) 

32766 

Modulo 32767 

32767 

Modulo 32768 


Reserved 

65535 

Linear (modulo 65536) 


modifier registers M« select the type of address arithmetic 
to be performed when an address register R« is to be up¬ 
dated. As shown in Table 2, the contents of the modifier 
registers are encoded to select the type of address 
arithmetic—linear, modulo, or reverse carry. The type of 
address arithmetic defines the type of data structure (array, 
sample shift register, or queue) being accessed in memory. 
Each address register R n is assigned an offset register Nn 
and a modifier register M n having the same register number 
for use in address calculations. For example, the address 
calculation (RO) + NO postincrements the contents of ad¬ 
dress register RO by the contents of offset register NO using 
the type of address arithmetic specified by the contents of 
modifier register MO. 

The address ALU output multiplexer allows any address 
register R n to be used as a pointer to any memory space. 
This must be done so that pointers will not be duplicated in 
multiple registers when multiple memory spaces are ac¬ 
cessed. For example, complex data pairs are typically stored 
in two data memory spaces (real part in x memory and 
imaginary part in y memory) at the same address. Efficient 
access to complex data pairs requires each pointer to be able 
to access both the x and y data memory spaces. 

Address modifiers. During an address calculation, a set 
of three registers R n, N n, and Mn are accessed by the ap¬ 
propriate address arithmetic unit. The contents of the 
selected modifier register are decoded by the address 
arithmetic unit so it can determine the type of address 
arithmetic it should perform on the selected address pointer 
R n. To understand the role of the address modifiers in 
creating data structures in memory, consider the examples 
of eight-bit address arithmetic shown in Figure 7. 


The linear address modifier example is identical to con¬ 
ventional microprocessor address calculations, in which ad¬ 
dress updating is performed by a conventional adder. The 
example shows the postincrement by the offset Nn address¬ 
ing mode, where the offset register contains the value 5. 
Linear addressing is most useful for addressing arrays of 
data. 

The reverse-carry address modifier example is performed 
through the propagation of the adder carry in the reverse 
direction, i.e., from the most significant bit to the least 
significant bit of the adder. A characteristic of typical fast 
Fourier transform algorithms is that the data and coeffi¬ 
cients may be stored in a nonsequential order called bit- 
reversed order. Reverse-carry addressing can calculate bit- 
reversed addresses for sequential access of FFT data and 
coefficients. For a 2^-point FFT, a postincrement by 2 k -' 
using reverse-carry address arithmetic will generate the bit- 
reversed address sequence. The example shown generates 
the bit-reversed address sequence for a 16-point FFT buffer 
starting at address 64. Reverse-carry addressing is equivalent 
to doing bit-reversed addressing with simpler hardware. 

The modulo address modifier example creates a circular 
(modulo) address region in memory with a lower boundary 
and an upper boundary. Modulo arithmetic keeps the ad¬ 
dress register pointing to a location within the modulo 
region by automatic wraparound if the pointer increments 
or decrements out of the modulo region. The modulo size— 
i.e., the length of the modulo region—is specified by the 
contents of the modifier register plus one. (The size is 20 in 
the example.) The modulo size can be any number from 2 
to 32768. The lower-boundary address and upper-boundary 
address of the modulo region need not be directly specified, 
since the modulo size implicitly defines all possible modulo 
region boundaries. The lower-boundary address must have 
as many least significant zeroes as the modifier value has 
significant bits (five bits for Mn = 19). For example, a 
modulo region of size 20 (Mn = 19) can have a lower¬ 
boundary address at any integer multiple of 32 (0, 32, 64, 

96, 128, 160, and so on). The upper-boundary address is the 
lower-boundary address plus the modulo size minus one. 
One of the possible modulo regions is selected implicitly by 
loading the address register Rn pointing to a location within 
a valid modulo region. (RO = 75 selects the modulo region 
from 64 to 83 in the example.) Modulo address arithmetic 
simulates a shift register in memory by simply updating an 
address pointer and eliminates the need to move data to per¬ 
form time-shift functions. Modulo address modifiers can 
create FIFO queues, delay lines, and sample shift registers 
in memory. They are also useful for interpolating and 
decimating filters and for generating periodic waveforms. 

Program controller 

The program controller execution unit performs instruc¬ 
tion flow control, instruction decoding, and exception pro¬ 
cessing. It consists of a program address generator, or PAG, 
an instruction decoder, an interrupt controller, and a bus 
controller. The PAG is nonpipelined and calculates a new 
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Linear address modifier 


MO = 255 = 1111 1111 for linear addressing with RO. 
Original registers: NO = 5, RO = 75 = 0100 1011 
Postincrement by offset NO: RO = 80 = 0101 0000 
Postincrement by offset NO: RO = 85 = 0101 0101 
Postincrement by offset NO: RO = 90 = 0101 1010 

Reverse-carry address modifier 

MO = 0 = 0000 0000 for reverse-carry addressing with RO. 
Original registers: NO = 8, RO = 64 = 0100 0000 
Postincrement by offset NO: RO = 72 = 0100 1000 
Postincrement by offset NO: RO = 68 = 0100 0100 
Postincrement by offset NO: RO = 76 = 0100 1100 

Modulo address modifier 

MO = 19 = 0001 0011 for modulo 20 addressing with RO. 
C.iginal registers: NO = 5, RO = 75 = 0100 1011 
Postincrement by offset NO: RO = 80 = 0101 0000 
Postincrement by offset NO: RO = 65 = 0100 0001 
Postincrement by offset NO: RO = 70 = 0100 0110 


instruction address every instruction cycle. Instruction 
prefetching is used to form a two-stage instruction pipeline 
having the basic timing shown in Table 3. One instruction 
cycle is used for the fetch operation, one for the decode 
operation, and one for the execute operation. During nor¬ 
mal instruction execution, the PAG is responsible for fetch¬ 
ing the instruction word two locations ahead of the current¬ 
ly executing instruction. The instruction fetch and decode 
operations overlap with instruction execution and take no 
execution time, with the exception of change-of-flow in¬ 
structions and possibly PAG bus accesses. Change-of-flow 
instructions must fetch two instruction words to refill the in¬ 
struction pipe. The PAG may not have immediate access to 
on-chip program memory if the currently executing instruc¬ 
tion is accessing a data operand in on-chip program 
memory. The PAG may also introduce wait states during ex¬ 
ternal instruction fetches because of slow off-chip program 
memory or because the currently executing instruction is 
using the memory expansion port. During a resource con¬ 
flict, instruction fetches have lower priority than data ac¬ 
cesses, since the currently executing instruction has the 
highest priority for use of chip resources. 

Hardware DO loops. Many DSP programs spend 90 per¬ 
cent of the time executing in 10 percent of the program 
code. Digital filtering routines usually consist of a small 
code kernel that is executed many times. The DSP56000 
provides hardware DO loop control to replace the software 
“decrement counter and branch” instruction normally 
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Figure 7. DSP56000 address 
arithmetic (8-bit examples). 


associated with DO loops. Straight-line coding is not needed 
to maximize speed since compact, looped code runs at the 
same speed as straight-line code. When executing DO loops, 
the DSP56000 eliminates the usual change-of-flow overhead 
by modifying the PAG instruction fetch sequence. It initi¬ 
ates a hardware DO loop by executing the DO instruction. 
The hardware DO loop mechanism stores the loop count 
(LC), loop starting address, and loop ending address (LA) 
in special registers and processes them in parallel with the 
executing program. Inside the DO loop, the instruction 
fetch address is compared to the loop ending address. When 
the end of the loop is detected, the loop count is tested for 
one. If the loop count is not one, it is decremented and the 
instruction word at the loop starting address is fetched. If 
the loop count is one, the DO loop execution is complete 
and normal sequential instruction fetches resume. The sav¬ 
ing and restoring of the previous hardware DO loop 
registers on a stack makes it possible to nest hardware DO 
loops with minimal overhead. A separate system stack is 
used to minimize overhead during nested hardware DO 
loops and multilevel interrupts. This hardware stack is 32 
bits wide and 15 locations deep. The double width allows 
two registers to be transferred to/from the system stack 
every instruction cycle. One can extend the system stack to 
any depth by moving the stack data to/from memory using 
software stacking techniques. 

Exception processing. Exception processing in a DSP en¬ 
vironment is primarily associated with the transfer of data 
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The Dsp56ooo 



Normal instruction timing 
Instruction 

cycle 1 2 3 4 5 6 7 8 9 10 

Fetch n3 n4 n5 n6 n7 n8 n9 nlO nil nl2 

Decode n2 n3 n4 n5 n6 n7 n8 n9 nlO nil 

Execute nl n2 n3 n4 n5 n6 n7 n8 n9 nlO 

n = normal instruction word 

Fast interrupt instruction timing 
Instruction 

cycle 123456789 10 

Fetch n3 n4 ivl iv2 n5 n6 n7 n8 n9 nlO 

Decode n2 n3 n4 ivl iv2 n5 n6 n7 n8 n9 

Execute nl n2 n3 n4 ivl iv2 n5 n6 n7 n8 

n = normal instruction word 

iv = interrupt vector instruction word (no change of flow) 

Long interrupt instruction timing 
Instruction 

cycle 1 23456789 10 

Fetch n3 n4 ivl iv2 il i2 i3 n5 n6 n7 

Decode n2 n3 n4 JSR — il RTI — n5 n6 

Execute nl n2 n3 n4 JSR — il RTI n5 

n = normal instruction word 

iv = interrupt vector instruction word (change of flow) 

JSR = jump to interrupt service routine 
i = interrupt service routine instruction word 
RTI = return from interrupt 


between processor memory or registers and a peripheral 
device. When an interrupt occurs, a limited context switch 
must be performed with minimum overhead. Saving all of 
the machine state is too time consuming and usually not 
necessary. The DSP56000 provides a sophisticated interrupt 
structure which reduces the timing overhead associated with 
servicing interrupts. Each peripheral device and external in¬ 
terrupt pin may be programmed to one of three interrupt 
priority levels (IPLs) so that time-critical interrupts are 
always serviced first. When more than one interrupt is 
pending within the same IPL, a fixed priority table deter¬ 


mines the secondary priority level within that IPL. Most on- 
chip peripherals have separate interrupt vectors for each in¬ 
terrupting condition so the cause of the interrupt will be 
known before the interrupt service routine is entered. There 
is no need to poll devices or test status bits to determine the 
interrupting condition. An interrupt vector may be error- 
free or not error-free. If it is error-free, no error conditions 
are associated with the interrupt request, and a data transfer 
can service the request immediately without checking error 
flags. If the interrupt vector is not error-free, error condi¬ 
tions are associated with the interrupt request, and the inter- 
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Table 4. 

Interrupt sources. 


rupt service routine must first check the error flags. As can 
be seen from Table 4, 18 out of 32 possible interrupt vectors 
are used. The two external interrupt pins, IRQ A and IRQB, 
may be programmed as level-sensitive or negative-edge 
triggered. 

The timing of the interrupt controller is shown in Table 3. 
First, the interrupt controller synchronizes and prioritizes 
pending interrupts to determine the highest-priority, un¬ 
masked interrupt request. Second, the instruction fetch 
stream is temporarily redirected to fetch two interrupt in¬ 
struction words (ivl and iv2) at the interrupt vector ad¬ 
dresses. The two interrupt instruction words at the interrupt 
vector addresses are fetched into the instruction pipeline 
without waiting for the current instruction to finish execu¬ 
tion. The program counter is held constant since the inter¬ 
rupt controller provides the two interrupt vector addresses. 
Finally, normal instruction fetches (n5, n6, and so on, ac¬ 
cording to the program counter’s contents) resume im¬ 
mediately after the two interrupt instruction words have 
been fetched. The usual interrupt vector change-of-flow is 
avoided and the extra cycles required to empty the instruc¬ 
tion pipeline and refill it are eliminated. However, if instruc¬ 
tion word «4 is the first word of a two-word instruction, the 
execution of n4 is aborted and //4 is refetched in place of 
n5. Since most instructions are one word long, this occurs 
infrequently. The two interrupt instruction words ivl and 
iv2 are decoded and executed. They may be two single-word 
instructions or one two-word instruction. 

Fast interrupts. If execution of the two interrupt instruc¬ 
tion words does not cause a change of flow, the interrupt 
routine is called a fast interrupt. In a fast interrupt, normal 
instruction execution continues without delay following the 
execution of the two interrupt instruction words ivl and iv2. 
Fast interrupts do not save the machine state, so instruc¬ 
tions that modify the machine state should not be used. Fast 
interrupts do not require a return from interrupt (RTI) in¬ 
struction, since no context switch is performed. Although 
any non-change-of-flow instruction can be used, a special 
move peripheral (MOVEP) instruction has been provided to 
support fast interrupts with a memory-to-memory data 
transfer between memory-mapped peripheral devices and 
any memory space. Fast interrupts can process up to 1.7 
million interrupts (or 10.25 million bytes) per second, yet 
they consume only 33 percent of the total execution time. 
This performance level rivals that of dedicated direct 
memory access hardware. Fast interrupts are a software 
alternative to DMA that provides several advantages. 

Unlike DMA, fast interrupts can service peripheral devices 
with the flexibility that only software can offer. Fast inter¬ 
rupts can use any addressing mode with linear, modulo, or 
reverse-carry address arithmetic. They can service on-chip as 
well as off-chip peripherals and memory using minimal 
hardware. Fast interrupts can also handle peripheral errors 
by vectoring to different interrupt service routines and can 
support termination conditions other than the traditional 
word count. 
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Interrupt 

vector 

starting 

address 

Error- 

free 

status 

Interrupting condition 

$0000 

— 

Hardware RESET 

$0002 

No 

Stack error 

$0004 

Yes 

Trace 

$0006 

Yes 

Software interrupt (SWI) 

$0008 

Yes 

External interrupt IRQA 

$000A 

Yes 

External interrupt IRQB 

$000C 

Yes 

SSI receive data 

$000E 

No 

SSI receive data with exception 

$0010 

Yes 

SSI transmit data 

$0012 

No 

SSI transmit data with exception 

$0014 

Yes 

SCI receive data 

$0016 

No 

SCI receive data with exception 

$0018 

Yes 

SCI transmit data 

$001A 

Yes 

SCI idle line 

$001C 

Yes 

SCI timer 

$001E 

— 

Reserved for hardware development 

$0020 

Yes 

HOST receive data 

$0022 

Yes 

HOST transmit data 

$0024 

Yes 

HOST command (default) 

$0026 

Yes 

Available for HOST command (AHC) 

$0028 

Yes 

AHC 

$003E 

Yes 

AHC 


Long interrupts. If either of the two interrupt instruction 
words is a jump to subroutine (JSR) instruction, the inter¬ 
rupt routine is called a long interrupt. Programming a JSR 
instruction at the interrupt vector addresses ivl or iv2 causes 
a long interrupt routine. When the JSR instruction is de¬ 
coded, the DSP56000 performs a context switch by saving 
the current program counter and status register on the stack 
and updating the interrupt mask in the status register. The 
program counter is loaded with the JSR destination address 
and the long interrupt routine (il, i2, and so on) begins exe¬ 
cution. If he desires it, the programmer can save more of 
the machine state by using software stacking operations. 

The long interrupt routine is terminated by the usual RTI 
instruction (located at i2 in this example). As shown in 
Table 3, this mechanism allows long interrupts to be vec¬ 
tored via the JSR destination address with a minimum of 
timing overhead. This change of flow is as fast as conven¬ 
tional vectored interrupts and is a natural extension of the 
fast interrupt mechanism. 


The internal multiple-bus architecture of the DSP56000 
consists of four data buses and three address buses connect¬ 
ing the various resources on the chip. Support of the fast 
multiplier/accumulator requires two multiplier input 
operands to be transferred between memory and the data 
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Table 5. 

DSP56000 instruction set. 


Arithmetic instructions 

ABS 

Absolute value 

ADC 

Add with carry 

ADD 

Add 

ADDL 

Shift left then add 

ADDR 

Shift right then add 

ASL 

Arithmetic shift left 

ASR 

Arithmetic shift right 

CLR 

Clear 

CMP 

Compare 

CMPM 

Compare magnitude 

DIV 

Divide iteration 

MAC 

Multiply/accumulate 

MACR 

Multiply/accumulate and round 

MPY 

Multiply 

MPYR 

Multiply and round 

NEG 

Negate 

NORM 

Normalize iteration 

RND 

Round 

SBC 

Subtract with carry 

SUB 

Subtract 

SUBL 

Shift left then subtract 

SUBR 

Shift right then subtract 

Tcc 

Transfer conditionally 

TFR 

Transfer 

TST 

Test 

Logical instructions 

AND 

Logical AND 

ANDI 

AND immediate control register 

EOR 

Logical exclusive OR 

LSL 

Logical shift left 

LSR 

Logical shift right 

NOT 

Complement 

OR 

Logical inclusive OR 

ORI 

OR immediate control register 

ROL 

Rotate left 

ROR 

Rotate right 

Bit manipulation instructions 

BCLR 

Bit test and clear 

BSET 

Bit test and set 

BCHG 

Bit test and change 

BTST 

Bit test on memory 

JCLR 

Jump if bit clear 

JSET 

Jump if bit set 

JSCLR 

Jump to subroutine if bit clear 

JSSET 

Jump to subroutine if bit set 


Program control instructions 


Jcc 

Jump conditionally 

JMP 

Jump 

JScc 

Jump to subroutine conditionally 

JSR 

Jump to subroutine 

NOP 

No operation 

REP 

Repeat next instruction 

RESET 

Reset on-chip peripheral devices 

RTI 

Return from interrupt 

RTS 

Return from subroutine 

STOP 

Stop processing 

SWI 

Software interrupt 

WAIT 

Wait for interrupt 

Loop instructions 

DO 

Start hardware loop 

ENDDO 

Exit from hardware loop 

Move instructions 

LUA 

Load updated address 

MOVE 

Move data 

MOVEC 

Move control register 

MOVEM 

Move program memory 

MOVEP 

Move peripheral data 


ALU upon each instruction cycle. Instruction fetches must 
also be done at the same rate. This translates to three data 
transfers (two data and one instruction) every instruction 
cycle. The bus structure supports general register-to-register, 
register-to-memory, and memory-to-register data movement 
and can transfer up to three 24-bit words at the same time. 
The resources connected to a bus define its primary data - 
transport function (see Figure 2 again). One address and 
data bus is associated with each of the three memory spaces 
(x data, y data, and p program), whereas a fourth data bus, 
called the global data bus, is shared by all three memory 
spaces. The role of the global data bus is to physically ex¬ 
tend the x, y, and p data buses so they can be connected to 
remote chip resources while keeping these buses as short as 
possible. Keeping the buses local enhances the speed of the 
device. 

Internal bus switch. Communication between the buses in 
the multiple-bus structure takes place in an internal bus 
switch. The internal bus switch is similar to a switch matrix 
and can connect any two internal buses without adding any 
pipeline delays. This flexibility allows a general data move 
capability for easier programming. Since the internal data 
bus switch can access each memory space, the bit manipula¬ 
tion unit is physically located in this block. 
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Table 6. 

Parallel move operations. 


Parallel move operation 

Example of assembler syntax 

No parallel move 

ADD X0,A 

Register — register 

ADD X0,A Y1,R0 

Address register update 

ADD X0,A (Rl)-Nl 

Immediate short data — reg. 

ADD X0,A #183,R4 

Immediate data — register 

ADD X0,A #$F97B4A,B 

Immediate data — register 

ADD X0,A #$123456,XI B,Y1 

plus register — register 

Absolute short address «-> reg. 

ADD X0,A Y:$3A,R4 

Absolute address -> register 

ADD X0,A A,X:$FFE3 

Absolute address <-> register 

ADD X0,A A,XI Y0,Y:$3F80 

plus register — register 

x memory *-> register 

ADD X0,A X0,X:(R5) + 

x memory <-» register 

ADD X0,A X:(R0) - ,X0 A,Y0 

plus register — register 

y memory <-* register 

ADD X0,A Y:(R0)+ ,R7 

/memory « register 

ADD X0,A B,X0 A,Y:(R0) + N0 

plus register — register 

x memory - register 

ADD X0,A X1,X:(R3) + Y:(R6 )-,B 

plus / memory <-* register 

L long memory <-* register 

ADD X0,A AB,L:(R2) + 


Here, “source,destination” assembler format is used and “ADD X0,A” is a sample opcode and operand. 


DSP56000 instruction set 

The DSP56000 has an easy to learn, microprocessor-style 
instruction set that is efficient for many different algo¬ 
rithms. Its instruction set has some characteristics of a 
reduced-instruction-set computer because of its register- 
based (load/store) orientation and because most of the in¬ 
structions are executed in a single cycle. However, enough 
complex-instruction-set computer functionality is built into 
each instruction that the DSP56000 achieves the high per¬ 
formance needed for small code loops typically consisting of 
only two to five instructions. Lack of data pipelining con¬ 
tributes to the ease of programming and eliminates any ar¬ 
chitectural bias against an algorithm. During development 
of the DSP56000, its designers used a set of 15 common 
DSP benchmarks to test the speed and coding efficiency of 
the instruction set whenever they made changes to it. These 
benchmarks included digital filters and fast Fourier trans¬ 
forms for real and complex data. (The benchmarks are 
listed in an appendix to the DSP56000 manual. 12 ) The de¬ 
signers used other benchmarks to measure performance for 
two-dimensional problems such as matrix manipulation and 
image processing. Many algorithms achieve high perfor¬ 
mance on the general-purpose DSP56000. A list of 
DSP56000 opcodes is shown in Table 5. 

In addition to the usual set of microprocessor opcodes, 
the DSP56000 instruction set provides a powerful set of 
multiply and multiply/accumulate instructions with options 
for rounding and positive or negative product accumula¬ 


tion. Other opcode additions include absolute value, shift 
left or right then add or subtract, compare magnitude, nor¬ 
malize, round, and transfer conditional instructions. When 
used after a compare or compare magnitude instruction, the 
transfer conditional (Tcc) instruction can perform maxi¬ 
mum value, minimum value, maximum absolute value, 
minimum absolute value, and other functions. 

Parallel move operations. Most arithmetic and logical in¬ 
structions can specify up to two data transfers in the same 
instruction. These data transfers are called parallel move 
operations and are executed in one instruction cycle in 
parallel with the instruction opcode. This allows two or 
three conventional instructions to be combined into one 
parallel instruction, with corresponding gains in speed and 
coding efficiency. Parallel move operations allow the 
register-based execution units to be kept busy by concur¬ 
rently preloading new input operands and storing previous 
results. They also provide concurrent communications be¬ 
tween execution units. Used with the local registers in each 
execution unit, the parallel move operations provide “soft¬ 
ware pipelining” controlled by the user. Thus, the user can 
adapt the DSP56000’s parallel architecture to his applica¬ 
tion. The parallel move operations are shown in Table 6. 

The assembler syntax samples illustrate the different parallel 
move operations that can be specified with the same ADD 
X0,A instruction. Note that the same register may be a 
source operand more than once in the same instruction. 
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Data bus movement 


Opcode 


Optional effective address extension 


modes are also provided, being encoded into the data bus 
movement field for fast, one-word instructions. Absolute 
address and immediate data addressing modes are encoded 
in the second optional effective address extension word to 
provide full 16-bit addresses and 24-bit data. 


Figure 8. DSP56000 instruction encoding format. 

However, no register may be specified as a destination 
operand more than once in the same instruction. 

The general instruction encoding format is shown in 
Figure 8. All instructions are one or two 24-bit words in 
length. The first word generally contains an 8-bit opcode 
field and a 16-bit data bus movement field. The opcode 
field includes the instruction opcode with its source and 
destination register operands. The data bus movement field 
specifies source and destination operands for parallel move 
operations over the x data and y data bus. For address 
register indirect addressing modes, the data bus movement 
field specifies up to two address registers and associated ad¬ 
dressing modes. As shown earlier in Table 1, the DSP56000 
provides a set of 14 addressing modes to minimize address 
generation overhead. Both register direct and address 
register indirect (pointer) addressing modes are available as 
one-word instructions. Absolute short, I/O short, and short 
jump addresses and immediate short data addressing 


On-chip resources 

The DSP56000 provides a large set of on-chip memory 
and I/O peripheral resources to support the core processor. 
The on-chip memories include two 256 x 24-bit data 
RAMs, two 256 x 24-bit data ROMs and one 2048 x 
24-bit program ROM. Data or program code can be moved 
from any memory space to another, whether on-chip or off- 
chip. Microcomputer-style I/O capability is provided by 
three on-chip peripherals—the parallel host MPU/DMA in¬ 
terface (host), the asynchronous serial communications in¬ 
terface (SCI), and the synchronous serial interface 
(SSI)—on 24 programmable, general-purpose port pins. 
Noncore resources are supported by standard move and bit 
manipulation instructions. This avoids irregular I/O instruc¬ 
tions and will allow Motorola to easily change noncore 
resources in future DSP56000 family members without af¬ 
fecting the instruction set. 

Memory expansion port. The DSP56000 chip pinout is 
shown in Figure 9. Memory expansion off-chip is provided 
by a memory expansion interface called Port A. External 
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Figure 9. DSP56000 pinout. 
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peripherals and slave processors (microprocessors or 
another DSP IC) may also be accessed through this port. 
Separate 16-bit address and 24-bit data buses are used to 
multiplex the three internal address buses and four internal 
data buses off-chip, respectively. Off-chip (external) 
memory spaces are a logical extension of on-chip (internal) 
memory spaces. The bus controller determines, from the 
value of the address, whether a memory access is external or 
internal and schedules the bus activity to optimally use the 
memory expansion bus. If only one external memory access 
is requested per instruction cycle, the request is granted im¬ 
mediately and no extra clock cycles are required. If two or 
three external memory accesses are requested in a given in¬ 
struction cycle, a minimum of one or two extra instruction 
cycles, respectively, are required to complete the instruction. 
The seven bus control signals create a synchronous bus that 
can perform 10.25 million accesses per second. Full-speed 
operation with no wait states requires a memory access time 
of 55 ns. The bus controller can be programmed to insert 0 
to 15 wait states for four types of external memory access. 
Each wait state is one clock cycle (or 48.75 ns) long with a 
20.5-MHz processor clock. This allows fast and slow exter¬ 
nal devices to be mixed on the memory expansion bus. An 
external device can also gain control of the memory expan¬ 
sion bus by asserting the bus request (BR) control input. In 
response to a BR, the DSP56000 releases control of the bus 
and asserts the bus grant (BG) control output at the end of 
the current bus access. 

General-purpose I/O pins. On the DSP56000, 24 pins can 
be programmed to be general-purpose I/O pins; these pins 
are called Port B and Port C (see Figure 9 again). When 
they are programmed in this way, they can be used as I/O 
flags for synchronization and control purposes. Individual 
pin control, data direction, and data transfer functions are 
provided by six internal memory-mapped registers. The 
user can change or test these registers using standard bit 
manipulation and jump-on-bit-condition instructions. Each 
port pin can also be programmed to serve as a dedicated pin 
for one of the three on-chip peripherals. 

Host processor interface. Although the DSP56000 can 
operate as a stand-alone processor, in many systems it is ac¬ 
companied by a host microprocessor. The host processor 
functions as the system controller and user interface, while 
the DSP56000 handles the real-time digital signal process¬ 
ing. To support its use in such multiple-processor systems, 
the DSP56000 includes an on-chip host processor interface. 
The host interface is a byte-wide, full-duplex parallel port 
that can be connected directly to the data bus of a host pro¬ 
cessor. The host processor may be any of a number of 
industry-standard microcomputers or microprocessors, 
another DSP IC, or DMA hardware. The host interface ap¬ 
pears as a memory-mapped peripheral occupying eight bytes 
in the host processor address space. It gives the host pro¬ 
cessor an eight-bit bidirectional data bus and seven control 
lines to use to control data transfers. It provides 14 internal 
registers to support double-buffered data transfers between 


the host processor and the core processor by means of asyn¬ 
chronous polling or interrupts. In DMA mode, the host in¬ 
terface allows an external DMA controller to perform DMA 
transfers between an external memory or device and the 
host interface’s registers using the host request (HREQ) and 
host acknowledge (HACK) handshake lines. From the 
DSP56000’s perspective, the DMA data can be transferred 
between the host interface registers and any DSP56000 
register or memory location (internal or external) by means 
of the standard instruction set. The fast interrupt mecha¬ 
nism can be used to minimize the data transfer overhead. 

A major accomplishment 
of the DSP56000 design was 
low power consumption in both 
active and standby modes. 

The host interface provides DMA initialization commands 
that are used to set up the host interface DMA channel. It 
also has a special host command feature that allows the 
host processor to issue a vectored interrupt request to a 
DSP56000 program. The host processor may select one of 
32 DSP56000 host command interrupts by writing a vector 
address register. Host commands are useful for debugging, 
performing on-line diagnostics, implementing control proto¬ 
cols, and setting up DMA. 

Serial communications interface. The serial communica¬ 
tions interface, or SCI, provides full-duplex serial com¬ 
munications with a variety of serial devices—including 
microprocessors, other DSP ICs, terminals, and modems— 
either directly or via RS-232 lines. The SCI supports 
industry-standard asynchronous character modes and allows 
parity and multidrop options. The multidrop option in¬ 
cludes wake-up-on-idle-line and wake-up-on-address-bit 
capabilities. A synchronous shift register mode allows I/O 
expansion and high-speed, synchronous data transmission. 
The SCI consists of separate transmit and receive sections 
and a programmable baud-rate generator. Seven internal 
registers provide doubled-buffered data transfer and control 
functions. Three I/O pins are used for transmit data, re¬ 
ceive data, and baud-rate clock functions. The baud rate 
can be internally or externally generated for asynchronous 
rates of up to 320K bits per second and synchronous rates 
of up to 2.5M bits per second. The internal baud-rate gener¬ 
ator can function as a periodic interrupt timer when it is not 
being used by the transmit and receive sections. 

Synchronous serial interface. The synchronous serial in¬ 
terface, or SSI, provides a full-duplex, double-buffered 
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Table 7. 

DSP56000 benchmark summary. 

Benchmark 

Performance 

N-tap real FIR filter with data shift 

97.5 ns per tap 

/V-tap real, LMS adaptive FIR filter 
with data shift 

292.5 ns per tap 

Aureal, cascaded IIR biquad filters 
(four coefficients) 

390 ns per filter 

/V-tap complex FIR filter with 
data shift 

390 ns per tap 

256-point complex FFT (radix 2, 
looped) 

0.706 ms 

1024-point complex FFT (radix 2, 
looped) 

4.994 ms 

Two-dimensional convolution 
(3x3 coefficient mask) 

975 ns per 
output 

Finding of maximum absolute 
value and index in array 

195 ns per point 
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Figure 10. DSP56000 finite impulse response (FIR) 
filter—data memory organization (top) and program 
(bottom). 


serial port that allows the DSP56000 to communicate with a 
variety of serial devices. These include one or more 
industry-standard codec A/D and D/A converters, other 
DSP ICs, microprocessors, and serial peripheral devices. 

The SSI consists of separate transmit and receive sections 
and an SSI clock generator. The clock generator defines the 
serial bit rate, the serial word size, and the number of serial 
words per frame. The data in each serial frame are con¬ 
trolled by software, allowing any user protocol to be imple¬ 
mented. Several clock and frame sync timing options pro¬ 
vide flexible, synchronous, serial communications at rates 
of up to 5M bits per second. The SSI uses three to six I/O 
pins, depending on the operating mode; it has eight internal 
registers. 

Three SSI operating modes support the requirements of 
different serial devices. The normal operating mode is used 
for periodic devices that transmit or receive one data word 
with each serial frame. One time slot is defined for data 
transmission at the start of each serial frame. A codec A/D 
or D/A converter is an example of such a periodic device. 
The on-demand operating mode is for nonperiodic com¬ 
munications such as those from one DSP56000 to another. 
No time slots are defined for transmission and data are 
transmitted as soon as they are available. The network 
operating mode defines from 2 to 32 time slots per serial 
frame which can be used for creating a network of com¬ 
municating serial devices. Each device can transmit or 
receive during one or more assigned time slots. The time slot 
assignments for each serial device are determined by the 
user’s software. In network mode, multiple DSP56000s can 
communicate without needing glue chips to do so. In net¬ 
work mode, the DSP56000 can also be interfaced directly to 
the time-division-multiplexed, serial I/O channels used in 
telecommunications applications. 

DSP56000 implementation 

The DSP56000 is implemented in a 1.5-/xm, double-level 
metal, N-well, high-density CMOS process. It is an 88-pin 
integrated circuit available in pin-grid-array or surface- 
mount packaging. Its maximum clock rate is currently 20.5 
MHz and is provided by an on-chip crystal oscillator or an 
external clock. One of the major design accomplishments 
was low power consumption in both active and standby 
modes. Because of the DSP56000’s full CMOS design, its 
power consumption scales down linearly with clock frequen¬ 
cy, allowing the user to reduce processor speed to save 
power. Since the DSP56000 contains substantial on-chip 
memory, the user can locate program and data sections with 
high dynamic frequency in on-chip memory to avoid driving 
the memory expansion port. The DSP56000’s bus controller 
does not toggle the external address and data bus pins 
unless an external access is made. The DSP56000 deselects 
all external devices to their low-power standby modes when 
they are not being accessed. It also saves power by using 
STOP and WAIT low-power standby modes. Execution of 
a STOP or WAIT instruction puts the DSP56000 into a 
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low-power standby mode while the DSP system is off-line 
or waiting for an interrupt to occur, respectively. This “soft¬ 
ware power control” technique is popular in the MC146805 
and MC68HC11 microcomputer families. 

Software examples and 
performance analysis 

The DSP56000’s performance of some common DSP 
benchmarks is summarized in Table 7. However, the perfor¬ 
mance and software features of the DSP56000 can best be 
demonstrated by several typical DSP software examples. 

Finite impulse response filter. The sample FIR digital 
filtering application of Figure 1 is called an Mh-order real 
filter, where Nis the number of coefficients and “real” 
means that the data and coefficients are real numbers. The 
DSP56000 assembler source program and memory organiza¬ 
tion for this system are shown in Figure 10. The N data 
samples are stored in x data memory and the N filter coeffi¬ 
cients are stored in y data memory. This natural partitioning 
is desirable since both x memory and y memory space can 
be accessed in the same instruction cycle. Both the data and 
coefficients are stored in modulo TV buffers, but for dif¬ 
ferent reasons. Modulo N data addressing is used to time- 
shift the simulated shift register by an incrementing of R0 
by 1 after the FIR calculation. Modulo N coefficient ad¬ 
dressing is used for convenience for automatically wrapping 
around, at each sampling period, the address pointer R4 to 
the first coefficient. The four instructions at START are 
used to set up the two modulo N address pointers R0 and 
R4. The seven instructions at FIR form a simple loop to get 
an “input” sample, perform the FIR filtering operation, 
and store the filter “output.” The first data and coefficient 
are preloaded into the data ALU while the accumulator is 
cleared (CLR). The actual filtering operation is performed 
by repeating (REP) the multiply/accumulate (MAC) in¬ 
struction. The last tap of the filter is performed by a 
multiply/accumulate and round (MACR) instruction to 
form a single-precision, rounded result. All calculations are 
performed to 56-bit precision and no intermediate data are 
lost. The accumulator extension register protects against 
overflows for large N. The FIR filtering kernel is performed 
in one instruction cycle per tap. A complete Mh-order, real 
FIR filter executes in N+3 instruction cycles. For a 32nd- 
order filter, a 20.5-MHz DSP56000 can process a sample 
frequency (fs ) as high as 250 kHz in real time. 

Infinite impulse response filter. An HR filter employs 
feedback paths to achieve an infinite impulse response and 
generally provides the most filtering capability for a given 
computational load and storage. One of the most popular 
IIR filters is the second-order section, or real biquad, filter. 
The biquad filter has two poles and two zeroes in its trans¬ 
fer function and is the digital equivalent of a two-pole 
analog filter. The biquad filter is often connected in series 
(cascaded) to form higher-order digital filters. Figure 11 
shows the biquad filter’s block diagram, its data memory 
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Figure II. DSP56000 infinite impulse response (IIR) filter— 
biquad IIR filter diagrams (top), data memory organization 
(middle), and program (bottom). 
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Figure 12. DSP56000 fast Fourier transform (FFT)— 
diagram of the “butterfly” kernel (top) and program 
(bottom). 


organization, and the DSP56000 assembler source program 
needed to implement N cascaded biquad digital filters. The 
memory is organized to store the filter storage w(n - 1) and 
w (/7 — 2) in x data memory and the filter coefficients a \, a2, 
b\, and b2 in y data memory. In this example, the time-shift 
function is performed by parallel move operations instead 
of by modulo addressing. The two instructions at START 
set up data pointer RO and coefficient pointer R4 for linear 
arithmetic. The program loop at HR loads a digital 
“input,” performs TV cascaded biquad filters, and stores the 
filter “output.” The actual biquad filter program consists 
of only four multiply/accumulate instructions located 
within a hardware DO loop. The DO instruction initiates 
the hardware DO loop with the instruction following the 
DO instruction and ends it with the instruction before the 
label ENDMIR. The digital output is rounded (RND) to 
single precision after the hardware DO loop has executed. 
The four-coefficient, biquad filter kernel executes in only 
four instructions, i.e., in 390 ns per biquad filter. Similar 
five-coefficient biquad filter kernels execute in five instruc¬ 
tions, or 487.5 ns. For typical voiceband processing with an 
8-kHz sampling rate, the DSP56000 can implement over 300 
biquad digital filters in real time. 

Fast Fourier transform. The fast Fourier transform, or 
FFT, is widely used in signal processing to perform spectral 
analysis of time-domain data. Most FFTs operate on com¬ 
plex data having real and imaginary components and hence 
must use complex arithmetic. The DSP56000 is very effi¬ 
cient at complex arithmetic. Complex arithmetic generally 
requires at least four multiplier input registers to store two 
complex inputs, two accumulators to store the real and im¬ 
aginary parts of the complex result, and address pointers 
that can access complex data pairs. A large number of FFT 
algorithms exist, but here we will demonstrate only the sim¬ 
ple, radix-2, decimation-in-time, in-place complex FFT. The 
heart of the radix-2 FFT is a complex “butterfly” kernel, 
which is shown in Figure 12. Note that all data paths and 
calculations are done in complex arithmetic. This butterfly 
kernel is executed many times to calculate a complete FFT. 
The complete DSP56000 macro program for a radix-2, 
decimation-in-time, in-place complex FFT is also shown in 
Figure 12. This macro may be called to perform an FFT of 
any size from 2 to 32,768 points. The example uses three 
nested hardware DO loops to compact the complete FFT 
program into only 40 words of program memory. The but¬ 
terfly kernel routine inside the innermost DO loop requires 
only six instructions, or 585 ns, to perform one butterfly 
calculation on two complex data points A and B. The single 
complex multiplication (B x C) is implemented as four real 
multiplies. The inner, middle, and outer DO loops build on 
the kernel routine to process one butterfly group, one 
butterfly pass, and the complete FFT transform, respective¬ 
ly. This radix-2 FFT using looped code executes a 256-point 
complex FFT in 814 j^s and a 1024-point complex FFT in 
6.41 ms. Other variations achieve even higher performance. 
For instance, modified versions of the code in the example 
execute a 1024-point complex FFT in less than 5 ms. 
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Since the FFT data are stored in 24-bit words, little or no 
scaling is required with typical 12-to-16-bit input data. The 
DSP56000 scaling modes can be used to scale the data of 
each FFT pass one bit left or right with minimum overhead. 
This can add 20 more bits of dynamic range to a 1024-point 
FFT while still maintaining full FFT speed. 

DSP56000 design tools 

The DSP56000 is supported by a software development 
package consisting of a full-featured macro cross assembler, 
ASM56000, a software simulator, SIM56000, and associated 
documentation. The package is available for the IBM 
PC/MS-DOS, VAX/VMS, and Unix environments. 
ASM56000 offers the usual complement of features found 
in modern assemblers, such as extensive error reporting, 
conditional assembly, file inclusion, nested macros with 
macro library support, local labels, sections, and external 


an instruction or clock cycle basis, single stepping or tracing 
with multilevel conditional or unconditional breakpoints. It 
provides instruction and clock cycle counts and generates 
histograms of those counts to support the analysis of pro¬ 
gram execution time. It displays the enabled set of registers 
and memory, highlights the write operations, and provides 
options for displaying upon a read, write, or other access. 
Of particular note are SIM56000’s input and output com¬ 
mands, which assign the simulator’s I/O, with a device pin, 
memory location, or on-chip peripheral, to a terminal or a 
disk file. Its I/O data format can be untimed or “time 
stamped,” and nested repeat directives can be used to 
generate arbitrary input data sequences. Besides being an 
accurate simulator of the DSP56000, SIM56000 is very easy 
to use. Help files describe each simulator command and a 
help line on the display indicates the command line syntax 
as commands are entered. A symbolic calculator assists the 
programmer with hexadecimal, decimal, and binary calcula- 


Digital signal processors are transforming analog circuits into software in the 1980’s in 
the same way microprocessors transformed digital control logic into software in the 1970’s. 


definition/reference directives. It also provides arbitrary ex¬ 
pression evaluation with Boolean operators and built-in 
functions for data type conversion, string comparison, and 
common transcendental functions such as sine, cosine, loga¬ 
rithm, exponent, and square root. These functions allow 
constants and lookup tables for DSP algorithms to be 
parameterized by macro arguments and dynamically 
generated at assembly time. ASM56000 also provides assem¬ 
bler output listings with instruction cycle counts, cross 
reference tables, and memory utilization reports. The 
memory utilization reports provide a global view of the 
allocated and free space in the DSP56000 memory map, 
which can become fragmented by inefficient placement of 
modulo and reverse-carry (bit-reversed) storage regions. A 
report identifies the free space available for additional 
storage in each memory map. 

The SIM56000 software simulator emulates, on a clock 
cycle basis, the functions of the DSP56000, including on- 
chip peripheral activity and external I/O pin activity. 
SIM56000 enables the software developer to execute 
DSP56000 object code generated from ASM56000 or from 
SIM56000’s own single-line assembler. Throughout the 
DSP56000 chip design, SIM56000 was used to compare 
against the chip data base simulations by running the same 
DSP56000 programs through both simulations. Both the 
chip design and the software products were debugged rapid¬ 
ly through this comparison testing. SIM56000 performs, on 


tions. The programmer may define, store, and execute 
simulator command macros. He can also do program 
patching by using SIM56000’s single-line assembler/disas¬ 
sembler. He can log all simulator activity to disk files for 
future analysis, and he can save the simulator state so he 
can resume the simulation later. SIM56000 uses ASCII disk 
files for all of its I/O (but not for saving the simulator 
state), enhancing access by other programs and utilities. 

Motorola is developing a DSP56000 evaluation module, 
or EVM, to serve as a low-cost design tool. The EVM con¬ 
sists of an evaluation board, or EVB, an interface card for 
the IBM PC bus, and a user interface software package, 
EVM56000. The EVB contains a full-speed 20.5-MHz 
DSP56000, 8192 x 24 bits of external program/data RAM, 
an expansion connector for adding prototype hardware, and 
a monitor ROM. It is controlled by the EVM56000 software 
running in the PC environment, which presents a user inter¬ 
face similar to that of the SIM56000 software simulator. 
EVM56000 retains many of SIM56000’s features and adds 
commands to support up to eight EVBs on the same host 
with foreground/background access to the user interface. 
The EVB can be used as a hardware accelerator for 
speeding up simulations or as a prototype board for 
developing target applications. 

Additional software support includes software source 
libraries, application notes, and a high-level-language com¬ 
piler. Moreover, Motorola supports the user with training 
seminars, video classes, and an electronic bulletin board. 
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he DSP56000 is a state-of-the-art, high-performance 
digital signal processor. The similarities between it 
and other Motorola microprocessors make it easy 
to learn and easy to program, yet the differences open up 
new user applications. The future for programmable digital 
signal processors looks extremely bright. Digital signal pro¬ 
cessors are transforming analog circuits into software in the 
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Feature 


The ADSP-2100 DSP 

Microprocessor 

John P. Roesgen 
Concord Data Systems 



The 2 100 accesses external memory efficiently and devotes its 

silicon area to providing greater functionality and processing throughput. 


T he performance of single-chip digital signal pro¬ 
cessors has always lagged far behind the re¬ 
quirements of many application areas. There are 
several reasons for this discrepancy, some related to the 
limits of VLSI techology and others to architectural con¬ 
siderations. However, given the inherent ease of use and 
cost advantages that these devices offer, there is a strong 
desire to extend their application into high-perfor¬ 
mance areas. 1 

One key factor that affects DSP performance is the inter¬ 
face between the processor and its memory. To date, most 
DSP chips dedicate a large portion of their silicon area to 
on-chip memory. A design of this sort not only constrains 
the size of the memory it also confines the processing logic 
to a smaller area and reduces its functionality. Furthermore, 
it usually limits access to external memory and in some cases 
incurs a speed penalty. 

The Analog Devices ADSP-2100 microprocessor repre¬ 


sents an alternative approach to digital signal processing ar¬ 
chitecture. Unlike several other single-chip DSPs, the 2100 
is designed to access external memory efficiently. With the 
exception of a small instruction cache, the chip itself con¬ 
tains no memory. The enormous amount of silicon real 
estate saved by excluding the memory is used to add func¬ 
tionality and increase processing throughput significantly 
beyond that of previous single-chip designs. 2 

The 2100 includes such items as a full-function barrel 
shifter for normalization and denormalization, two in¬ 
dependent data address generators with modulo addressing 
capability, a program sequencer with provisions for zero- 
overhead looping, a background register set for rapid con¬ 
text switching, and sufficient internal busing to support a 
high degree of parallelism in the instruction set. An ad¬ 
vanced 1.5-micrometer CMOS process gives the chip an in¬ 
struction cycle time of 125 nsec and a power consumption 
of less than 1/2 watt. 
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System configuration 

Figure 1 shows a basic ADSP-2100 system configuration. 
The processor interfaces with two external memory systems, 
a program memory and a data memory. As the names sug¬ 
gest, program memory holds the application program and 
data memory holds the system data. Because they are 
separate, instructions and data can be accessed simultane¬ 
ously. Data can also be stored in the program memory, 
which also allows dual data access. 

A set of address, data, and control lines is provided for 
each memory. On the program memory side there are 14 ad¬ 
dress lines (PMA), 24 data lines (PMD), a memory select 
signal (PMS), read and write strobes (PMRD and PMWR), 
and a signal to indicate when data (as opposed to an in¬ 
struction) is being accessed (PMDA). The 14 address lines 
give an address range of 16K words, which can be expanded 
to 32K if the PMDA signal is used as an additional address 
bit. 

On the data memory side there are 14 address lines 
(DMA), 16 data lines (DMD), a memory select (DMS), read 
and write strobes (DMRD and DMWR), and a signal to 
acknowledge the transfer of data (DMACK). Peripheral 
devices are memory mapped into the data memory address 
space. Slower devices can stretch the memory cycle as 
needed by withholding the DMACK signal. 

The chip supports multiprocessing applications with bus- 
request and bus-grant signals (BR and BG). The 2100 re¬ 
sponds to a bus request by halting program execution and 
releasing the address, data, and control lines to the memo¬ 
ries so that another processor can access them directly. 

Four interrupt request (IRQ) inputs are provided for ex¬ 
ternal devices that need periodic service from the processor. 
The interrupt pins can be individually programmed for 
either level or edge sensitivity. The four inputs are priori¬ 
tized with options for nesting (higher priority levels inter¬ 
rupting lower ones) or blocking (only one level serviced at a 
time). The maximum response time for an unmasked inter¬ 
rupt request is two cycles. 

A high-level internal block diagram of the 2100 is shown 
in Figure 2. Three separate computational units are pro¬ 
vided, an ALU, a multiplier-accumulator (MAC), and a 
barrel shifter. Together, these offer a wide variety of fast 
arithmetic functions. Two independent data address units 
generate the external memory addresses that keep data flow¬ 
ing between computation and memory. The program se¬ 
quencer coupled with an on-chip cache memory maintains a 
continuous instruction stream to the rest of the processor. 
Five major buses speed the transfer of information between 
the various functional blocks. These buses provide the 
necessary paths to allow the execution of complex multi¬ 
function instructions in one machine cycle. Four of them 
extend off chip to become the address and data lines for the 
external memories. 


Computational features 

The computational section of the processor is divided in¬ 
to three independent units. Rather than being arranged in 
the usual series fashion, these units rest side by side, relying 
on the R bus as a flexible interconnect path. Operation of 
the R bus allows any sequence of arithmetic operations to 
be performed smoothly, without excessive juggling of inter¬ 
mediate results. 

The 16-bit-wide ALU performs general-purpose arith¬ 
metic and logical operations. The arithmetic functions in¬ 
clude add, subtract, negate, increment, decrement, absolute 
value, and divide. Provisions are included for both double¬ 
precision and saturation arithmetic. The available logic 
functions are AND, OR, Exclusive OR, and NOT. 

The MAC performs multiply, multiply-accumulate, and 
multiply-subtract operations. The 16 x 16-bit multiplier array 
produces a 32-bit product, which is fed into a 40-bit adder/ 
subtracter. The final, 40-bit-wide result leaves plenty of 
room for overflow. Multiplier inputs can be any combina¬ 
tion of signed or unsigned formats, making double-preci¬ 
sion multiplication possible. Options also exist for unbiased 
rounding and saturation of the final result. 

The shifter efficiently implements the numerical scaling 
operations needed for floating-point arithmetic. These 
operations include normalization, denormalization, shifting 
by a constant, and deriving an exponent for an individual 
number or block of numbers. The shifter array accepts a 
16-bit input and produces a 32-bit output. Both zero-filling 
and sign extension of the result are available. Multiprecision 
shifting operations are also fully supported. 

Each of these arithmetic units contains a set of input and 
output registers, which act as stopover points for data as it 
moves between the external memory and the computational 
circuitry. The registers therefore introduce a level of pipelin¬ 
ing into the dataflow. The processor’s instruction set accom¬ 
modates this capability by allowing computations and regis¬ 
ter-memory transfers to be overlapped. Computational 
operations take their operands either from the local input 
registers or from an output register via the R bus and then 
load the result into a local output register. 

Register names are derived from their function to ease the 
programming task. For example, AXO and AX1 are the 
ALU X input registers, AYO and AY1 are the ALU Y input 
registers, and AR is the ALU result register. The equivalent 
registers in the MAC are MXO, MX1, MYO, MY1, and MR. 
The shifter’s register names are SI for the input register, SR 
for the result register, and SE for the exponent register. 

A complete set of background input and output registers 
can be activated at any time if the processor must change 
tasks quickly. This capability effectively doubles the number 
of available registers and can eliminate the save and restore 
overhead associated with context switching. So for example, 
execution of interrupt service routines that require the com¬ 
putational facilities of the processor can be sped up tremen¬ 
dously. By switching to the background register set, the pro¬ 
cessor can save its current computational state in one cycle. 
Switching back to the original registers will then restore the 
previous context at a later time. 
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Figure 2. 

ADSP-2100 internal 
block diagram. 


Address generation 

Fast number-crunching hardware is of little benefit if it 
must frequently sit idle waiting for data. A powerful 
memory-addressing scheme prevents this situation and 
keeps memory references going at a rate equal to the pro¬ 
cessing rate. Should an operation (such as an add or multi¬ 
ply) require two operands, both must be supplied at this 
rate. 

The 2100 contains two independent address generators. 
December 1986 


Both supply data memory addresses, but one of them can 
also address the program memory, allowing access to 
“data” stored there as well. Thus the processor has the 
capability of fetching two operands simultaneously, one 
from data memory and one from program memory. One 
obvious application is digital filtering. By storing samples in 
one memory and coefficients in the other, the processor can 
access both in a single machine cycle and keep the multi¬ 
plier-accumulator running at full speed. 
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Each address generator contains the elements shown in 
Figure 3. Memory pointers are kept in the I (index) register 
file. The M (modify) register file contains incremental 
values, which move the pointers by a desired amount each 
time they are used. The L (length) registers define the size of 
each data structure being accessed. Each of the register files 
contains four 14-bit registers, which are loadable and 
readable via the internal DMD bus. Address generator 1 
also has a bit-reversing capability to aid in the scrambling or 
unscrambling of data in fast Fourier transforms. 

Whenever an indexed memory reference is made, a 
selected I register provides the address. An independently 
selected M register is then added to the address to form a 
tentative next address. The tentative next address feeds into 
the modulus logic along with the selected L register value. 
The modulus logic determines whether the new address is 
outside the bounds of its associated data structure. If it is, 
the address wraps around in a modulo fashion to remain 
within its allowable range. Otherwise, the address passes 
through the modulus logic unchanged. In either case the 
output of the modulus logic is loaded back into the original 
I register, ready for the next memory reference. The com¬ 
plete address modification process can be described with the 
following formula: 


Next address 
I 
M 
B 
L 


(I + M - B)modulo(L) + B 
Index register value 
Modify register value 
Base address 
Length register value 


Notice that this computation requires the memory base 
address but that it is not supplied. This information is im¬ 
plied indirectly by adopting the following two rules: 


• If the buffer length requires n bits to be represented in 
binary, the lower n bits of the buffer base address must be 
zero, and 

•The modify value M should not be greater than the 
length L. 


With these restrictions, the buffer base address B can be ex¬ 
tracted from the I + M value by masking out the lower n bits 
and setting them to zero. 

As an example of modulo addressing, suppose that an I 
register points to the last location in a circular buffer of 
length 5. Then modifying the pointer with an M register 
containing + 1 actually decrements the address by 4, back 
to the first location in the buffer. If the pointer is then 
modified by an M register containing -2, the address in¬ 
crements by 3 and points to the next-to-last location. Of 
course if a pointer modification does not cross either 
boundary, it behaves in the usual fashion. 


Cache memory 

Fetching data from program memory would seem to be in 
conflict with the normal instruction fetches that keep the 
program going. One way for the processor to deal with this 
conflict is for it to insert an extra memory cycle for the data 


fetch. But this method negates the advantage of storing the 
data in program memory, since it could have just as easily 
been fetched from data memory during an extra cycle. 

Fortunately, most time-critical computations are repetitive 
in nature. The 2100 executes these computations in the form 
of program loops, and this is where the on-chip cache 
memory comes in. The job of the cache memory is to main¬ 
tain a small (16-word) history of previously executed in¬ 
structions. When the program enters a loop, the cache 
stores the loop instructions on the first pass, but on all 
subsequent passes it can feed the instruction register and 
free the program memory for data fetches without incurring 
extra cycles. Thus the processor’s performance under these 
conditions approaches that of a three-memory system. The 
processor maintains the cache memory and makes the extra 
cycle decision during execution, making them transparent to 
the user. 


Program sequencer 

Keeping numerical throughput high also requires a 
sophisticated program sequencer, for if the processor gets 
bogged down by branching, looping, or responding to inter¬ 
rupts, the computation rate suffers. A large portion of the 
2100 chip area was dedicated to the program sequencer to 
streamline the program flow and minimize overhead. Figure 
4 shows a detailed block diagram of the program sequencer. 

Instruction addresses can come from four possible 
sources: a 14-bit program counter (PC), an internal 16-level 
PC stack, an interrupt controller, or a 14-bit field of the in¬ 
struction register. The program counter keeps track of the 
current instruction address and feeds an incrementer, which 
provides the next contiguous address. The PC stack stores 
subroutine and interrupt-return addresses and is chosen 
when returning to main program execution. The interrupt 
controller monitors the external interrupt-request inputs and 
provides jump vectors when servicing is needed. The in¬ 
struction register is chosen when a direct jump is executed. 

The 2100 includes status registers to keep track of 
arithmetic results, execution modes, and interrupt con¬ 
figuration. Arithmetic status drives the condition logic that 
controls the selection of a next address for conditional 
operations. Interrupt configuration status transfers to the 
interrupt controller. An internal four-level-deep stack saves 
status information automatically when vectoring to an inter¬ 
rupt service routine and restores it upon return. The status 
stack can also be pushed or popped manually at any time. 

The down counter controls program looping with a decre¬ 
ment and branch feature. Preloaded via the internal DMD 
bus, it generates a counter-expired (CE) status output when 
the count reaches zero. Decrementing occurs automatically 
every time the status is checked. A four-level stack asso¬ 
ciated with the counter allows counted loops to be nested 
five levels deep. 

The loop stack and comparator also facilitate program 
looping. The do-until instruction sets up these functions. 
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Figure 3. 

Address generator 


block diagram. 



Figure 4. 

Program sequencer 
block diagram. 


When executed, this instruction pushes the end-of-loop ad¬ 
dress and termination condition onto the loop stack and the 
beginning-of-loop address (PC+ 1) onto the PC stack. Once 
the loop is entered, the loop comparator compares the next 
address output of the sequencer with the end-of-loop ad¬ 
dress on the loop stack. When the two are equal, it indicates 
that the processor is fetching the last instruction in the loop. 


During the next cycle, while the last instruction is being exe¬ 
cuted, the condition logic tests the termination condition 
specified by the loop stack. If the termination condition is 
false, the sequencer jumps back to the beginning of the loop 
by choosing the PC stack as the next address. Otherwise, 
the sequencer exits the loop by choosing PC + 1 as the next 
address. 
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Move instructions 
Register « register 
Register « data memory 
Register « program memory 
Immediate value — register 
Immediate value — data memory 


Computational instructions 

Conditional ALU/MAC/SHIFT operation 
ALU/MAC/SHIFT operation with register ~ register 
ALU/MAC/SHIFT operation with register « data memory 
ALU/MAC/SHIFT operation with register - program 
memory 

ALU/MAC operation with data memory - register and 
program memory — register 


Program flow control 
Conditional jump 
Conditional subroutine call 
Conditional return 
Conditional trap 
Conditional do-until 


Miscellaneous 
Saturate accumulator 
Modify index register 
Push status stack 
Pop status/loop/counter/PC stack 
Mode control 
No-op 


Figure 5. ADSP-2100 instruction set summary. 


This automatic looping mechanism eliminates the need 
for an explicit jump instruction within the loop. Every loop 
instruction is free to execute useful operations so the loop¬ 
ing overhead is reduced to zero. Do-until loops can be any 
length, and the loop stack allows them to be nested four 
levels deep. 


Instruction set 

Figure 5 summarizes the ADSP-2100 instruction set. 

Four basic categories of instructions exist. The move in¬ 
structions encompass register-register transfers, register- 
memory transfers, and immediate loading of registers and 
data memory. Data memory addresses are supplied either by 
the data address generators or directly from a Field in the in¬ 
struction word. Program memory addresses can only come 
from data address generator 2. Computational instructions 
exercise the ALU, MAC, and shifter functions. These func¬ 
tions can be executed conditionally based on current status 
register contents, or they can be combined with register- 
register and register memory move operations, including a 
simultaneous read of both program and data memories. The 
program-flow instructions direct the activities of the pro¬ 
gram sequencer. Execution of these instructions can either 
be unconditional or conditioned on the current status 
register contents. The miscellaneous instructions include 
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saturation of the multiplier-accumulator output register, 
manual modification of address generator index registers, 
and manual pushing and popping of the various internal 
stacks. All of the instructions are coded into a single 24-bit 
word and execute in one cycle. 

The 2100’s instruction set with its diversity and parallel¬ 
ism does not lend itself to the usual mnemonic notation 
used for most computer assembly languages. Instead, the 
assembly syntax uses an algebraic notation that clearly spells 
out the action taken by each instruction. No user needs to 
memorize cryptic abbreviations to either write programs or 
read them. Straightforward notational conventions plus the 
general lack of arbitrary restrictions in the instruction set 
combine to produce assembly code that rivals high-level 
languages for ease of use and readability. 


FIR filter 

The features and capabilities of the 2100 are best demon¬ 
strated through programming examples of routines com¬ 
monly used in DSP applications. A very basic but important 
DSP algorithm is the finite-impulse-response filter. A FIR 
filter directly implements a discrete convolution between a 
series of input samples and coefficients. Because of their 
simplicity and well-behaved numerical properties, FIR filters 
function in a wide range of problems, particularly in the 
area of telecommunications. 3 

See Figure 6 for a 2100 assembly-language subroutine for 
an FIR filter. Parameters pass to the routine through regis¬ 
ters that are set up by the calling program. Index register 10 
points to the filter delay line in data memory where the in¬ 
put samples are held. Index register 14 points to the filter 
coefficients that are stored in program memory. Modify 
registers M0 and M5, loaded with the value 1, move the 
delay line and coefficient pointers ahead one place each 
time they access memory. Finally, length register L0 is 
loaded with the filter order, which indicates the length of 
the delay line buffer. Each call to the subroutine takes one 
input sample from register AX0 and generates one output 
sample that is passed back to the calling program in regis¬ 
ter MR. 

The actual filter code consists of only eight instructions. 
The first instruction (labeled FIRFILT) moves the contents 
of register L0 into register AY0. The assembler recognizes 
the equal sign as a transfer operator with the source on the 
right and the destination on the left. The second instruction 
combines an ALU operation with a data memory write. For 
the memory write, register AX0 provides the data, register 
10 provides the memory address, and register M0 postmodi- 
fies 10. In other words, the input sample is written into the 
filter delay line and the pointer is incremented. The ALU 
portion of this instruction decrements register AY0 and 
places the result into register AR. AR now contains the 
filter order minus one, which is moved into the loop counter 
by the third instruction. 

The fourth instruction does three things: It clears the 
MAC result register (MR), it fetches the first sample from 
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.MODULE FIR; 

{ 


******************************************************************************** 
* * 

* FIR Filter Subroutine * 

* * 

* L-l * 

* Computation: Y(n) = SUM [H(k)*X(n-k)] * 

* k=0 * 

* * 

* Y: output samples * 

* X: input samples * 

* H: coefficients * 

* * 

* Input: 10 points to filter delay line (in data memory) * 

* 14 points to filter coefficients (in program memory) * 

* MO contains 1 * 

* M5 contains 1 * 

* LO contains filter order * 

* AXO contains input sample * 

* * 

* Output: MR contains output sample * 

* * 

* Execution Cycles: L+9 (L = filter order) * 

* * 
******************************************************************************** 


FIRF1LT: 


TAPLOOP: 


•ENDMOD; 


AY0=L0; 

DH(IO,MO)=AXO, AR=AY0-1; 

CNTR=AR; 

MR=0, MX0=DM(I0,M0), MY0=PM(14,M5); 

DO TAPLOOP UNTIL CE; 

MR=MR+MX0*MY0, MX0=DM(10,M0), MY0=PM(14,M5); 
MR=MR+MX0*MY0 (RND); 

RTS ; 


{ Store input sample } 

{ Clear Y, Get X, Get H ) 

{ Y=Y+(X*H), Get next X, Get next H } 
{ Y=Y+(X*H) ) 


Figure 6. FIR filter subroutine. 


data memory and places it into register MXO, and it fetches 
the first coefficient from program memory and places it into 
register MYO. Register 10 provides the address to the data 
memory and register 14 does the same for the program 
memory. MO and M5 postmodify the two addresses. The 
fifth instruction sets up the looping hardware for a do-until 
loop. Taploop is a label given to the last instruction in the 
loop (in this case the only one), and CE refers to the 
counter-expired condition that terminates execution of the 
loop. The do-until instruction pushes the address of 
Taploop and the CE condition code onto the loop stack, 
and it pushes the contents of the program counter plus one 
onto the PC stack. 

At this point the processor has internally stored all of the 
necessary information for sequencing through the loop. The 
beginning address of the loop is on the PC stack, and the 
ending address and termination condition are on the loop 
stack. The loop sequencing now becomes automatic and the 
loop instructions are not burdened by it. The FIR filter 
routine loop contains a single instruction that executes a 
multiply-accumulate operation and two memory fetches. 

The instruction syntax shows the MR register being loaded 
with the sum of itself and the product of registers MXO and 
MYO. It also shows MXO being loaded with the next sample 
from data memory while MYO is loaded with the next coef¬ 
ficient from program memory. MXO and MYO supply the 
multiply operands before they are reloaded with new values. 
The instruction syntax clearly shows these actions when read 


from left to right. The processor will loop on this instruc¬ 
tion, decrementing and testing the loop counter each time. 

As the loop execution proceeds, the memory pointers in¬ 
crement through their respective buffers. The coefficient 
pointer 14 starts out at the beginning of the coefficient buf¬ 
fer and just reaches the end as the loop terminates. How¬ 
ever this is not true for the sample delay line pointer 10. The 
filter delay line is a circular buffer whose starting point 
changes each time the filter routine is called. Since the 
samples do not physically move through the memory, the 
processor must realign its addressing each time a new output 
is computed. All of the delay line samples must be accessed 
on each pass, but the starting point is generally somewhere 
in the middle of the buffer. This means that a wraparound 
of the address must occur at some point. Putting the filter 
order (or equivalently the delay line length) into length 
register LO takes care of this requirement automatically, 
using the modulo addressing capability. 

It is important to note that the loop in this subroutine 
fits easily into the 2100’s internal cache memory. So during 
loop execution, the cache can assume the program memo¬ 
ry function, and the system performs like a three-memory 
architecture. All but one of the coefficients are fetched 
without incurring an extra cycle penalty. As with the 
modulo addressing, this action is transparent to the pro¬ 
grammer. The processor fetches from cache whenever 
possible, without explicit codes or directions. 
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.MODULE BIQUAD; 

( 

******************************************************************************** 
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Biquad Cascade HR Fi 1 te r Subrou tine 
Algorithm: For each biquad section 


2 2 

Y(N) = SUM [B(K)*X(N-K)1 + SUM (A(K)*Y(N-K)j 
K=0 K=1 


Input: SRI contains input sample 

10 points to delay line bufler (in data memory) 

14 points to coefficient buffer (in program memory) 

MO contains 1 
MI contains -3 
M4 contains 1 

CNTR contains number of biquad sections 
Output: SRI contains output sample 

Cycles: (7*L)+10 where L is the number of biquad sections 


* 

★ 

★ 

* 

★ 

* 

* 

★ 

★ 

* 

★ 

Hr 

* 

* 

★ 

★ 

★ 

★ 

* 

* 

★ 


★a****************************************************************************** 

} 


BIQUAD: 


SECTIONS: 


•ENDMOD; 


SE=SCALE ; 

DO SECTIONS UNTIL CE; 

MX0=DM(I0,M0), MY0=PM(I4,M4); { 

MR=MX0*MY0, MX1=DM(I0,M0), MY0=PM(14,M4); { 

MR=MR+MX1*MY0, MY0=PM(14,M4); { 

MR=MR+SR1*MY0, MX0=DM(10,MO), MY0=PM(14,M4); { 
MR=MR+MX0*MY0, MX0=DM(10,M1), MY0=PM(14,M4); { 
DM(I0,M0)=MX1, MR=MR+MX0*MY0 (RND); { 

DM(I0,M0)=SR1, SR=ASHIFT MR1 (HI); { 

DM(10,MO)-MX0; ( 

DM(10,M0)=SR1; { 

RTS; 


GET X(N-2), GET B(2) } 

X(N-2 )*B(2), GET X(N-l), GET B(l) ) 
X(N-1)*B(1), GET B(0) ) 

X(N)*B(0), GET Y(N-2), GET A(2) } 
Y(N-2)*A(2), GET Y(N-l), GET A(l) ) 
STORE X(N-l) AS X(N-2), Y(N-1)*A(I) ) 
STORE X(N) AS X(N-l), ADJUST Y(N) ) 
STORE Y(N-l) AS Y(N-2) } 

STORE Y(N) AS Y(N-l) ) 


Figure 7. Biquad filter subroutine. 


Loop execution terminates when the counter reaches 
zero. Automatic popping of the appropriate stacks 
restores the internal hardware to its original state, and 
program control transfers to the location immediately 
following the loop. These actions happen one cycle early, 
since the counter was loaded with the filter order minus 
one. The early transfer occurs because no additional 
memory fetches are needed after the last multiply- 
accumulate. This operation occurs outside the loop with 
the rounding option enabled so that the best 16-bit result 
can be obtained. The last instruction is a return-from-sub- 
routine that transfers control back to the calling program. 


Biquad HR filter 

Infinite impulse response (HR) filters provide a somewhat 
different approach to the filtering problem. The basic com¬ 
putational operation is still multiply-accumulate, but feed¬ 
back terms appear with the feed-forward paths, giving poles 
as well as zeroes. We refer to a second-order filter with two 
poles and two zeroes arranged so that all the products are ac¬ 
cumulated into a single node as a biquad filter. Higher order 
filters can be constructed by cascading biquad sections. 


Figure 7 displays the 2100 assembly code for a series of 
biquad filter sections. Input parameters to the subroutine 
pass through internal registers. The parameters include 
pointers to the filter delay line and coefficient buffer, several 
address modification values, and the number of cascaded 
filter sections. As with the FIR filter, each call to this sub¬ 
routine accepts one input sample and generates one output 
sample. 

Each biquad section theoretically requires two delay lines, 
one for the inputs (to feed forward) and one for the outputs 
(to feed back). However, since they are cascaded, the output 
delay line of the «th section is identical to the input delay 
line of the n + 1st section. There is no need to store redun¬ 
dant information, so the two delay lines are combined into 
a single one that is shared between the two adjacent sec¬ 
tions. All of the combined delay lines are then concatenated 
to form a single buffer in memory. 

The bulk of the subroutine consists of a do-until loop 
labeled Sections. The loop contains seven instructions and 
executes each biquad section in seven cycles. Most of the in¬ 
structions perform an arithmetic operation with one or two 
memory accesses. As before, program memory stores the 
filter coefficients and data memory stores the delay line buf- 
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fer, allowing simultaneous access. The modulo addressing 
feature is not used here at all. It is more desirable to have 
one index register access all of the concatenated delay lines 
than to dedicate a pointer to each one. Also, since the in¬ 
dividual delay lines are only two samples long, not much ef¬ 
fort is required to shift them manually. This step occurs as 
part of the last two instructions in the loop. The last instruc¬ 
tion also performs an arithmetic scaling operation on the 
filter output. It corrects for any prescaling that may have 
been necessary to represent the coefficients in fixed-point 
format. 


Fast Fourier transform 

A somewhat more sophisticated example of the 2100’s 
abilities is provided by that war-horse of DSP specsman- 
ship, the fast Fourier transform. The importance of the 
FFT for this purpose is justified, for it taxes the computa¬ 
tional and addressing capabilities of any processor. Its ap¬ 
plication to such areas as electronic instrumentation, radar 
systems, and speech processing gives it practical importance 
as well. 4 

The radix-2 FFT computation is divided into several 
stages. If the transform size is N (a power of two), there are 
LogOV) stages. All of the data samples are processed at each 
stage to produce new samples, which in turn are processed 
by the next stage. The butterfly, the basic kernel computa¬ 
tion, operates on two complex samples at a time and 
generates two new samples. Each stage contains N/2 but¬ 
terflies arranged in contiguous groups. For the decimation- 
in-frequency FFT the first stage has a single group of N/2 
butterflies. In each successive stage the number of butterfly 
groups doubles and the size of each group is cut in half. In 
the last stage, there are N/2 groups with one butterfly each. 
The computation for each stage is usually performed “in- 
place” so that only one data storage area is required. 

Figure 8 depicts a 2100 routine for a decimation-in¬ 
frequency FFT. The code has a structure that closely follows 
the format described above. Written as a generic subrou¬ 
tine, it transforms any power-of-two number of complex 
time samples into an equal number of complex frequency 
samples. Input parameters pass through registers and 
memory locations to define the size of the transform and 
the location of the data and coefficient buffers. Index 
registers 14 and 15 point to the cosine and sine tables in pro¬ 
gram memory. Modify registers MO and Ml are set up to 
+ 1 and -1 so that the memory pointers can be moved both 
forward and backward. Length registers L4 and L5 contain 
the length information that controls the modulo addressing 
of the sine and cosine tables. The loop counter is loaded 
with the total number of stages in the transform. Three data 
memory locations store the initial values for the number of 
butterfly groups per stage, the memory spacing between ad¬ 
jacent groups, and the number of butterflies per group. One 
final data memory location provides the base address of the 
data buffer. 


Although the code is a good deal more complicated than 
the filtering examples, the same basic mechanisms are at 
work. The three nested do-until loops correspond to the 
basic FFT entities, butterflies, groups, and stages. The in¬ 
nermost loop (butterflies) consists of nine instructions that 
execute the DIF butterfly computation. Packed into these 
nine instructions are eight arithmetic operations, 10 memory 
references, and a register move. Since this loop fits into 
cache memory, the extra cycles for the two program memo¬ 
ry references disappear after the first pass. The butterfly 
loop is contained within another loop (groups) that executes 
a contiguous group of butterflies. The four instructions pre¬ 
ceding the butterfly loop set up the counter and fill the data 
pipeline. Four instructions immediately follow the butterfly 
loop; they reposition the data memory pointers for the next 
group of contiguous butterflies. The outermost loop (stages) 
executes a complete pass through the data buffer, producing 
a new set of samples to be processed during the next pass. 
The instructions at the beginning of this loop initialize the 
various memory pointers and modify values for the first 
group of butterflies in the current stage. Instructions to up¬ 
date the parameters that define the grouping of butterflies 
for the next stage appear at the end of the loop. 

The execution time for any size FFT can be computed 
easily. For instance, if N = 1024, the total number of but¬ 
terflies is 5120, the total number of butterfly groups is 1023, 
and the number of stages is 10. Plugging these values into 
the execution cycle formula given in the subroutine header 
yields a total of 57,545 processor cycles. For a processor cy¬ 
cle time of 125 nsec, the 1024-point FFT executes in 7.2 
msec. 

The dynamic range of the FFT algorithm improves by 
maintaining a block-floating-point representation of the 
data. The 2100 FFT subroutine can easily be modified to ac¬ 
commodate a block representation by using the block expo¬ 
nent and normalization functions of the shifter. The block 
exponent derivation adds three cycles to the butterfly loop. 
This operation yields the exponent of the largest number 
produced at each stage. If this exponent is such that an 
overflow might occur during the next stage, every number 
in the data buffer is downshifted by an appropriate amount. 
This modification requires a two-cycle do-until loop at the 
end of the stage loop; the loop is executed only when 
downshifting is necessary. Assuming that downshifting oc¬ 
curs between half of the stages, a 1024-point, block float¬ 
ing-point FFT executes in only 11.7 msec. Table 1 sum¬ 
marizes the performance of the ADSP-2100 for these and 
other common algorithms. 


T he new generation of programmable DSP pro¬ 
cessors must be able to cope easily with such things 
as adaptive filtering, linear prediction, pattern 
recognition, dynamic programming, matrix decomposition, 
and vector quantization. Also an increasing need exists to 
deal with floating-point, complex, or multidimensional 
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.MODULE EFT; 

( 

******************************************************************************** 

* Decimation-in-Frequency FKT * 

* * 
Algorithm: DIF butterfly operation is 

* * 

* Cr=Ar+Br * 

t Ci = Ai + Bi * 

„ Dr = (COS * (Ar - B r)) - (SIN * (Ai - Bi)) * 

* Di = (COS * (Ai - Bi)) + (SIN * (Ar - B r)) * 

* Input: 14 points to COS table (in program memory) * 

* 15 points to SIN table (in program memory) * 

* MO contains 1 * 

* Ml contains -1 * 

* L4 contains COS table length = N/2 * 

* L5 contains SIN table length = N/2 * 

* CNTR contains number of stages = Log(lO * 

* DM(GRPCOUNT) contains group count for first stage = 1 * 

* DM(GRPSPACING) contains group spacing for first stage = N * 

* DM (BFYPERGRP) contains bu tte rf lies/g roup for first stage = N/2 * 

* DM(START) points to beginning of data buffer (in data memory) * 

* * 

* Output: Frequency samples in data buffer in bit reversed order * 

* Execution Cycles: (9*B)+(11*G)+(2l*S)+2 

* where B is the total number of butterflies = [N*Log(N)]/2 

* G is the total number of groups = N-l 

* S is the total number of stages = Log(N) 
★******************************************************************************* 
} 

FFTDIF: 


BUTTERFLIES 


GROUPS: 


STAGES: 


.ENDMOD; 


DO STAGES UNTIL CE; 

AX0=DM(START); 

IO=AXO; 

AYO=DM(GRPSPACING); 

M2=AY0; 

I1=AX0; 

MODIFY(II ,M2); 

12 = 11 ; 

AX0=2; 

AR=AY0-AX0; 

M3=AR; 

CNTR=DM(GRPCOUNT); 

M5=CNTR; 

DO GROUPS UNTIL CE; 

CNTR=DM(BFYPERGRP); 

AXO=DM(IO,MO); 

AY0=DM(11,M0); 

AY1=DM(II,M0); 

DO BUTTERFLIES UNTIL CE; 

AR=AXO+AYO, AX1=DM(I0,M1), MY0=PM(14,M5 ) ; 
DM(I0,M0)=AR, AR=AX1+AY1; 

DM(I0,M0)=AR, AR=AX0-AY0: 

MX0=AR, AR=AX1-AY1; 

MR=MXO*MYO, AX0=DM(I0,M0), MY1=PM(I5,M5) ; 
MR=MR-AR*MY1 (RND), AY0=DM(11,MO); 

DM(I2,M0)=MR1, MR=AR*MY0; 

MR=MR+MX0*MY1 (RND), AY1=DM(I1,M0); 

DM(12,M0)=MR1; 

MODIFY(12,M2); 

MODIFY(I1,M3); 

MODIFY(10,M3); 

MODIFY(10,M0); 

SI=DM(GRPC0UNT); 

SR=LSHIFT SI BY I 
DM(GRPCOUNT)=SR0; 

SI=DM(GRPSPACING); 

SR=LSHIFT SI BY -1 (L0); 

DM(GRPSPACINC)=SR0; 

SR=LSHIFT SRO BY -1 
DM(BFYPERGRP)=SR0; 

RTS ; 


l INITIALIZE A,C POINTER ) 

{ INITIALIZE B POINTER ) 

( INITIALIZE D POINTER ) 


{ GET GROUP COUNT ) 

{ GET SIN/COS INCREMENT } 

{ GET BUTTERFLY/GROUP COUNT ) 

{ GET Ar ) 

{ GET Br } 

{ GET Bi ) 

( C r=A rfB r, GET Ai, GET COS ) 

{ STORE Cr, Ci=Ai+Bi ) 

{ STORE Ci, COMPUTE Ar-Br ) 

( MOVE Ar-Br to MACC, COMPUTE Ai-Bi } 

{ COMPUTE COS* (Ar-Br), GET A r, GET SIN ) 
{ Dr=COS*(Ar-Br)-SIN*(Ai-Bi), GET Br ) 

{ STORE Dr, COMPUTE SIN*(Ar-Br) ) 

{ Di=COS*(Ai-Bi)+SIN*(A r-B r), GET Bi ) 

( STORE Di } 

{ ADVANCE D POINTER BY GRPDIST ) 

{ ADVANCE B POINTER BY GRPDIST-2 ) 

{ ADVANCE A,C POINTER BY GRPDIST-1 ) 

{ DOUBLE GROUP COUNT ) 


(LO); 


(LO); 


{ CUT GROUP SPACING IN HALF } 


{ CUT BUTTERFLIES/GROUP IN HALF ) 


Figure 8. FFT subroutine. 
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data. The ADSP-2100 expands the realm of programmable 
signal processors into these areas, making possible the im¬ 
plementation of DSP systems that previously would have re¬ 
quired special-purpose solutions. 5 H 
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Table 1. 

ADSP-2100 performance figures. 


Routine 

Execution time 

64-tap FIR filter 

8.0/xs per output sample 

64-tap, complex FIR filter 

32.0 m s per output sample 

Biquad filter section 

0.88/is per section 

Normalized lattice filter 

section 

0.63/ts per section 

1024-point, complex FFT 

7.2ms total 

64-tap FIR filter gradient 

adaptation 

16.0/rs total 

Two-dimensional convolution 

(3x3 mask) 

2.5/us per output sample 

Matrix multiply 

(10 x 10 matrices) 

0.22ms total 

Floating-point multiply- 

accumulate 

1.625/ts total 

Trigonometric sine 

3.25/xs total 
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Feature 



This 150-ns device performs full 32-bit floating-point arithmetic, incorporates on- 
chip instruction and data memory, and provides both serial and parallel I/O. 


NEC’s /*PD77230 Digital 

Signal Processor 



O ver the past several years, microprocessor 

technology has produced a number of interesting 
offshoots, including single-chip digital signal pro¬ 
cessors. The primary feature that distinguishes a digital 
signal processing device from an “ordinary” microproces¬ 
sor is an on-chip hardware multiplier that operates in a 
single instruction cycle. Typically, a large number of mul¬ 
tiplications are employed in DSP algorithms, and the 
kernels of such algorithms consist of a few operations 
repeated many times. These algorithms often implement a 
sum of products, in which many multiplications must be 
done on each input sample. These algorithms are used in 
applications such as finite impulse response (FIR) filters, in¬ 
finite impulse response (IIR) filters, autocorrelators, and 
fast Fourier transforms (FFTs). Multiplication speed is 
therefore one of the most important considerations in 
choosing a DSP device for a given application. 

Here, we present a brief history of DSP devices, the ra¬ 
tionale for developing a floating-point signal processor, 
details of the hardware architecture of a new digital signal 
processor from NEC Electronics, an overview of its instruc¬ 
tion set, and certain benchmarks and code examples. 


DSP history 

The Intel 2920, introduced in 1980, marked the first 
generation of single-chip DSP devices. Although it did not 
include a hardware multiplier, it may be considered a DSP 
device since it contained on-chip A/D and D/A converters 
and was designed for small, repetitive tasks. 

The second generation of single-chip DSP devices is 
typified by the NEC juPD7720 and the Texas Instruments 
TMS320. Both provide a one-instruction-cycle multiplier, 
but they have somewhat different architectural philoso¬ 
phies. The TMS32010, which was introduced in 1982, has 
an instruction set similar to that of a general-purpose micro¬ 
processor—it has instructions like LOAD, MOVE, and 
ADD. The NEC ^PD7720, which was introduced in 1980, 
has a microcode-like instruction set that it combines with a 
parallel architecture to enable a single instruction to load 
the two multiplier inputs, accumulate the multiplied prod¬ 
uct, modify both RAM/ROM pointers, and execute a 
return from subroutine. 

When faced with an application whose requirements ex¬ 
ceed the capability of a single-chip DSP device such as the 


60 


0272-1732/86/1200-0060$01.00 © 1986 IEEE 


IEEE MICRO 




TMS32010 or /rPD7720, many system designers turn to a 
bit-slice solution. Bit-slice architectures make use of in¬ 
dependent data paths and parallel structures in which 
microcoding is employed. Although bit-slice processors pro¬ 
vide very high performance, they are difficult to program 
and frequently require hardware to be reconfigured. They 
also must be built out of several discrete components. 

Now, third-generation single-chip DSP devices, which are 
characterized by fabrication in low-power CMOS, fast in¬ 
struction cycle times, high-precision arithmetic, and on-chip 
resources such as RAM and ROM, are becoming available. 
Their high performance, low cost, and ease of use make 
them an attractive replacement for bit-slice processors. 


The NEC /*PD77230 

NEC Electronics recently introduced the first member of 
a family of third-generation, CMOS digital signal pro¬ 
cessors. Called the /iPD77230 Advanced Signal Processor, 
or ASP, the new device incorporates full 32-bit floating¬ 
point (24-bit mantissa, 8-bit exponent) arithmetic, a 150-ns 
instruction cycle time (even for multiply/accumulate), a IK 
x 32-bit internal RAM, a 2K x 32-bit internal instruction 
ROM, a IK x 32-bit data ROM, and serial and parallel 
I/O. The device integrates more than 370,000 transistors in 
a 1.75-/un CMOS process and dissipates less than one watt. 
It is packaged in a 68-pin grid array. 

A variant of the /xPD77230 ASP will be introduced—the 
/xPD77220, a fixed-point-only version of the /xPD77230, will 
have half its internal RAM (2 x 256 x 24), will cost less, 
and will consume approximately 0.7 watt. 

The rationale for a floating-point 
digital signal processor 

Many first-generation DSP chips represent numbers in 
fixed-point form. Samples are therefore peak-limited by + 1 
and - 1, sometimes requiring that the filter coefficient 
values be prescaled. This prescaling operation frequently in¬ 
troduces a round-off error into the system. For this reason, 
finite-length integer calculations are often modeled with a 
“white noise” error source added to the incoming signal. 
Not only can truncation and rounding off cause a loss of 
precision, but the accumulated products of truncated or 
rounded values can alter the system’s overall transfer func¬ 
tion. The relocating of these poles and zeroes can introduce 
instabilities into the system. Furthermore, fixed-point 
calculations are very susceptible to overflow (underflow) 
because of their limited word length. Many fixed-point sys¬ 
tems require overflow checking routines that saturate 
overflow values. Not only is a significant level of error 
introduced when a value is clamped, but additional pro¬ 
gramming overhead is required. When efficient code for 
sum-of-product (finite impulse response) filters is being de¬ 
veloped, for example, overflow detection—when re¬ 
quired—can account for a substantial number of instruc¬ 


tions. Multiple-stage operations such as the biquad filter 
also require overhead for overflow checking. 

Floating-point devices avoid many of these drawbacks. 
The ASP, for example, provides 24-bit precision in the 
mantissa, just as a 24-bit fixed-point device does, but also 
employs an 8-bit, two’s-complement exponent, which in¬ 
creases the dynamic range of the signal. Since the ASP can 
represent much smaller numbers, it enhances the overall 
precision of the system in which it is used. In most cases, 
overflow checking is unnecessary with the ASP. Eight bits 
of exponent permits numbers with absolute values as large 
as 1.7 x 10 38 and as small as 3.5 x 10 _46 . 


ASP hardware architecture 

The architecture of the /^PD77230 ASP is quite similar to 
that of a microcoded building block system. It incorporates 
a full 32 x 32-bit floating-point multiplier, a 55-bit floating¬ 
point ALU, a 47-bit barrel shifter, a program and data 
ROM, a data RAM, and control (Figure 1). By intercon¬ 
necting these functional blocks with multiple data paths on 
a single chip, the ASP allows several operations to occur si¬ 
multaneously. For example, the ASP can perform a 
floating-point multiply, a floating-point addition, dual base 
and index pointer moves, a barrel shift, an internal data bus 
register transfer, and serial I/O in a single 150-ns cycle. In 
contrast, many other digital signal processors are restricted 
to operating on only one functional block (the multiplier, 
for example) in a cycle. The ASP therefore combines the 
flexibility of a general-purpose microprocessor with the 
computational power of high-performance signal processing 
elements. 

The ASP’s processing unit consists of three major blocks: 
the ALU, the barrel shifter, and the working registers (ac¬ 
cumulators). The ALU is a 47-bit, fixed-point unit using 
two inputs selected by two input registers, P and Q. The Q 
register selects input from one of the eight working regis¬ 
ters, while the P register chooses input from one of four 
sources—the main bus, the multiplier output, or one or the 
other of the two internal RAM blocks. The ASP employs a 
microcoded type of instruction set that multiplexes the P 
and Q inputs to the ALU. The ALU can perform a 55-bit 
floating-point add/subtract, a 47-bit fixed-point add/sub¬ 
tract, and 47-bit logical operations (XOR, AND, OR, and 
NOT). It can also reverse the bit order to aid in addressing 
an in-place fast Fourier transform operation. 

The processing unit uses the 47-bit bidirectional barrel 
shifter in several ways. During a floating-point add or sub¬ 
tract, for example, the exponent arithmetic unit, or EAU, 
signals the barrel shifter to align the two floating-point 
numbers so it can add their mantissas. And when fixed- 
point arithmetic is being performed, a shift value may be 
specified in the same instruction. This value is latched into 
the shift value register, or SVR, so that it need not be 
specified on each instruction. The SVR shifts the P input 
before it enters the ALU, and therefore the P input can be 
used to prescale a series of numbers. 
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Figure 1. Block diagram of the /*PD77230. 
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The barrel shifter can also be used to shift a working reg¬ 
ister by n bits, either left or right. 

Finally, the barrel shifter is used whenever a normaliza¬ 
tion instruction is executed in one of the working registers. 
This occurs during conversions between ASP and IEEE 
32-bit floating-point formats. As we noted before, the ASP 
floating-point format uses a two’s-complement exponent 
and mantissa. The ASP’s designers chose this format to 
reduce hardware complexity. The IEEE format, however, 
has an offset exponent and sign-magnitude mantissa. In 
order to convert numbers to IEEE format, the ASP must 
add the offset to the exponent, change the mantissa to an 
absolute value, and then rotate and fill the mantissa to yield 
the proper hidden bit format. 

There are eight 55-bit working registers on the processing 
unit bus; they can be exchanged or moved via the main bus 
into any other register or RAM location. For example, 
when a finite impulse response, or FIR, filter is being per¬ 
formed, a working register can be continuously ac¬ 
cumulated—that is, it can be fed back to the floating-point 
adder on each successive cycle. This implies that FIR filters 
require only one instruction cycle per tap. 

The multiplier section has two 32-bit floating-point inputs 
that can be loaded simultaneously, one from the main bus 
and the other from a special sub-bus that accesses either of 
the two independent RAM blocks. Internally, the multiplier 
consists of a 24 x 24-bit, two’s-complement, fixed-point 
multiplier with a 47-bit result and an 8-bit exponent adder 
that yield a 55-bit floating-point product. This product can 
be routed to a number of destinations, input to the floating¬ 
point ALU, moved to one of the working registers, or trun¬ 
cated to 32 bits and stored in other registers or internal 
RAM. The multiplier is implemented through a modified 
Booth’s algorithm in a Wallace tree configuration. 


ASP memory and addressing 

The /xPD77230 ASP features IK X 32 bits of internal 
RAM, 2K x 32 bits of internal instruction ROM, and IK 
x 32 bits of internal data/coefficient ROM. Both ROM 
areas can be masked. The ASP employs several memory 
addressing modes, each optimized for particular DSP 
operations. 

The IK x 32 bits of data RAM are organized as two 
separately addressable 512 x 32-bit blocks. Each block has 
its own base and index pointer that can be independently 
modified in an instruction. Furthermore, the base registers 
can be placed in a modulo count mode. In this mode, the 
upper n bits (« = 1, 2, 3, ..., 8) are set to a fixed pattern 
and the lower (9 - n ) bits can be used to cycle through a 
table and wrap around to zero without generating a carry to 
the higher n bits. The RAM blocks are available to the main 
bus, the P input register, or either of the multiplier inputs. 

The IK x 32 bits of data/coefficient ROM are accessed 
either by the 10-bit ROM pointer (which is incremented or 
decremented) or by the 9-bit field immediately specified by 
an instruction. The ROM includes a special modification 
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feature that allows a 2" add operation. ROM output data 
are available for use anywhere in the device via the main 
bus. 

The 2K x 32 bits of internal instruction ROM are con¬ 
trolled by a 13-bit program counter that is saved on an 
eight-level internal stack during subroutine and interrupt 
calls. The contents of the top of the stack are available to 
the main bus, and consequently the programmer can direct¬ 
ly modify the program counter. 

ASP operating modes 

The ASP is configured at power-up to perform in either a 
master or slave mode. In master mode, it connects 32 data 
pins to the system data bus. It directly addresses up to 4K 
words of external instructions and up to 8K words of exter¬ 
nal data. The external memory overlaps in such a way that 
if 4K of external instruction is used, only 4K of external 
data is available. Furthermore, the lower 4K of external 
memory is high-speed and can be accessed in a single cycle. 

This implies that the external instruction memory in the 
high-speed area must have an access time of 45 ns. There is 
no loss in speed if the ASP uses an external instruction in¬ 
stead of one from internal memory. The low-speed area 
stores data only and accesses standard 250-ns memory in 
three cycles. 

In slave mode, the ASP interfaces to the system bus 
through a 16-bit I/O port that transfers data in 16- or 32-bit 
words. A local data bus, accessible only to a slave ASP, has 
a data length that can be programmed, in byte increments, 
to be from 8 to 32 bits. In addition, a slave ASP provides 
four general-purpose I/O pins: two programmable outputs 
and two testable inputs. 

The ASP includes a status register for controlling the 
various operating modes. It also provides both maskable 
and nonmaskable interrupt control and incorporates a loop 
counter register that is automatically decremented. It has an 
external system clock that can be used to synchronize mul¬ 
tiprocessor configurations. 

The ASP’s independent serial I/O sections are ideal for 
interfacing it directly to a codec or a successive-approxima¬ 
tion analog-to-digital converter. The ASP’s serial interface 
section contains serial shift registers, parallel-to-serial con¬ 
verters, and input/output control circuitry. The serial out¬ 
put of one ASP can be directly cascaded to the serial input 
of another ASP. The serial input and output sections are in¬ 
dependently programmable—the transfer-bit length for each 
can be specified, in byte increments, to be from 8 to 32 bits. 
They can also be configured to use an internal or external 
serial clock, either allowing a transfer rate of up to 5 MHz. 

The serial input and output sections also have separate 
enable signals and they automatically reset during loss of 
synchronization. 

The ASP instruction set 

The instruction set of the /xPD77230 ASP is best de¬ 
scribed as horizontal microcode. There are three basic types 
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of instructions: OP, load immediate, and branch (see Figure 
2). The highly parallel architecture of the ASP enables a 
single instruction to perform several simultaneous opera¬ 
tions. All instructions execute in a single 150-ns cycle and 
occupy one word (32 bits) of instruction memory. As men¬ 
tioned earlier, there is no penalty in speed associated with 
the use of the external instruction space. 

The most frequently used instruction type is OP. OP in¬ 
structions have six separate fields of microcoded bits that 


perform the following functions: select ALU operation, 
select ALU operands, transfer internal data (including the 
double load of multiplier inputs), modify RAM pointers, 
modify ROM pointers, and control modes. 

There are 26 ALU operations that process two 32-bit 
floating-point numbers to produce a 55-bit result (Table 1). 
The P and Q inputs select the operands for the ALU opera¬ 
tion. The Q input (three bits) selects one of the eight work¬ 
ing registers (accumulators), while the P input (two bits) 



OP Type Ins 

31 27 26 15 

truction 

14 13 12 10 

9 5 

4 0 


OP [51 CNT [12] 

P [2] 0 [3] 

SRC [5] 

DST [5] 

Branch Type 

31 28 27 W 

nstruction 

14 10 

9 5 

4 0 

B [4] NA [13[ 

C [5] 

SRC [5] 

DST [5] 

Load Type Instruction 

31 29 28 5 

4 0 

LDI [31 IM [Ml 

DST [5] 


83-003772B 

Figure 2. /rPD77230 instruction types. 


Table 1. 

Specifications for the fields in OP instructions. 


Mnemonic 

Operation 

Mnemonic 

Operation 

NOP 

No operation 

CLR 

Clear 

INC 

Increment 

NORM 

Normalize 

DEC 

Decrement 

CVT 

Convert floating-point format 

ABS 

Absolute value 

ADD 

Fixed-point add 

NOT 

Not—ones complement 

SUB 

Fixed-point subtract 

NEG 

Negate—twos complement 

ADDC 

Fixed-point add with carry 

SHLC 

Shift left with carry 

SUBC 

Fixed-point subtract with borrow 

SHRC 

Shift right with carry 

CMP 

Compare (floating point) 

ROL 

Rotate left 

AND 

Logical AND 

ROR 

Rotate right 

OR 

Logical OR 

SHLM 

Shift left multiple 

XOR 

Logical exclusive OR 

SHRM 

Shift right multiple 

ADDF 

Floating-point add 

SHRAM 

Shift right arithmetic multiple 

SUBF 

Floating-point subtract 
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selects either the main bus, multiplier output, or one of the 
two RAM blocks. 

The internal data transfer field moves one of 32 source 
and destination registers in parallel with the execution of an 
OP instruction. The multiplier is always active and 
generates a product on each cycle. If new inputs are not 
loaded during an instruction, the previous values are simply 
remultiplied. 

The control field can assume one of 15 combinations of 
subfields. These subfields are specified in an instruction by 
the appropriate mnemonic. The assembler configures the 
operation and checks for legal combinations of specifiers. 
The subfields control RAM operating modes, ROM 
base/index pointers, loop counter decrementing, value/nor¬ 
malization register shifting, transfer formatting, and the 
like. 

A “load immediate data” instruction specifies the 24-bit 
mantissa and its destination register. Since an instruction 
word is only 32 bits wide, floating-point data must be 
loaded in two cycles. In this case, the eight-bit exponent is 
stored beforehand in a temporary register. 

The branch instruction type includes jump, conditional 
branch, subroutine call, and return instructions. Branch in¬ 
structions employ a 5-bit condition field and a 13-bit next- 
address field. A branch instruction can perform an internal 
data bus transfer while it is executing regardless of the 
branch condition. Because of the pipelining of the ASP, the 
instruction following a branch instruction is always executed. 

In general, the more pipelined a processor’s architecture, 
the more difficult it becomes to branch and clear the con¬ 
tents of the pipeline. The ASP uses a relatively simple three- 
stage pipeline—an instruction fetch, execution, and result 
occur in three successive cycles pipelined to yield one-cycle- 
per-instruction throughput. The latency is three cycles, but 
this is usually insignificant since many DSP programs are 
long and repetitive. 

A side effect of this pipelined operation is that the in¬ 
struction immediately following a branch is always exe¬ 
cuted, regardless of whether the branch condition was met. 
This occurs because the pipeline has already prefetched the 
instruction following the branch. One consequence of this 
side effect is that the programmer will make a branch the 
next-to-last instruction in a loop. In this way, he ensures 
that the last instruction of the loop will always be executed. 
An example of this pipelining can be seen in the code for a 
FIR filter (Figure 3). 


Code examples 

Below, we describe in detail the implementation of some 
frequently used DSP operations. We explain the algorithms 
for the finite impulse response (FIR) filter, the biquad filter, 
and the fast Fourier transform (FFT) and show /*PD77230 
source code for each. In each case, the algorithms are im¬ 
plemented for floating-point operations. However, with 
only slight modification they can be performed at the same 
speed in a fixed-point environment. (See benchmarks in 
Table 2.) 


Initial Conditions: 

Loop Counter (LC) = Number of taps 

RAMO contains the delay taps X(n—i) 

RAMI contains the coefficients A(i) 

Base Pointer 0 (BPO) = First delay tap 

Base Pointer 1 (BP1) = First coefficient 

Working Register 0 (WRO) = New input sample 

Note: Instructions are separated by horizontal 
lines. Multiple entries represent subfields of 
a single instruction. 

MOV RAMO, WRO 
CLR WRO ; 

Input sample stored in RAMO 
Clear working register for 
summation output 

MOV KLR1, RAMO ; Load multiplier with first 
sample and coefficient 

MOVTR, K 

INCBPO 

INCBP1 ; 

Save input sample in 
temporary register 

Move Base Pointer 0 to first 
delay tap 

Move Base Pointer 1 to 
second coefficient 

START: 

ADDF WRO, M 

MOV KLR1, RAMO 

Beginning of loop 

Add previous multiplier 
result to summation 
; Load next delay tap and 
coefficient to multiplier 

MOV RAMO, TR 

DECLC; 

Save previous delayed sample 
in current tap 

Decrement Loop Counter, 
zero skips next instruction 

JMP START; 

If Loop Counter not zero, 
loop back for next tap 

MOVTR, K 

INCBPO 

INCBPI ; 

The next instruction will 
always be executed even 
during branch, due to pipe¬ 
lining 

Retrieve current delay tap 
from multiplier reg 

Move pointer to next delay 
tap 

Move pointer to next co¬ 
efficient 


Figure 3. Code 
for the FIR filter. 
Example 1— 
repetitive loop 
calculation. 


Table 2. 

jtxPD77230 benchmarks. 

Division 

4.8 fjs 

Square root (Newton’s method) 

6.0 fjs 

SIN, COS (Thylor series) 

10.8 fjs 

ATAN (Maclaurin expansion) 

40.0 n s 

Biquad filter (1 stage) 

0.9 /us 

FIR filter (32 taps) 

5.2 /us 

Complex FFT 


32 points 

150.0 /us 

512 points 

4.7 ms 

1024 points 

12.5 ms 

(Uses externa] memory) 
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Figure 4. Flow diagram for the 
FIR filter. 



Finite impulse response filter. The FIR filter is simply a 
sum of products with no feedback terms. The output of 
each sample is a weighted sum of the new input value and 
the n - 1 previous delayed samples (Figure 4). The weighted 
coefficient values are selected to produce a given frequency 
response for the digital filter. 

Speed and program space can be traded off in the im¬ 
plementation of an FIR filter. In one algorithm, the basic 
loop moves the nth sample and the nth coefficient to the 
multiplier inputs, accumulates the previous multiplier result, 


replaces the n - 1th delay tap with the nth tap, decrements a 
counter, and then branches to the beginning. Although this 
implementation of an FIR filter is extremely inefficient (it 
requires four cycles per tap), it works in the general case. 
Figure 3 shows the code for this implementation. 

When n is a power of two, however, the ASP’s modulo 
count capabilities can be employed. A circular buffer can be 
implemented that simulates replacing the « — 1th tap with 
the «th tap by restoring the pointer after the entire filter 
length has been executed. Instead of physically saving the 


Initial Conditions: 

Number of taps (N) must be a power of 2 
Modulo Counter is set to log 2 (N) 

RAMO contains the delay taps X(n-i) 

RAMI contains the coefficients A(i) 

Base Pointer 0 (BPO) = First delay tap 
Base Pointer 1 (BP1) = First coefficient 
Wor king Register 0 (WRO) = New input sample 

Note: Instructions are separated by horizontal 
lines. Multiple entries represent subfields of 
a single instruction._ 

MOV RAMO, WRO Place input sample in RAMO 
CLR WRO ; Clear Working Register 0 for 

summation 

MOV KLR1, RAMO Load multiplier with input 
sample and coefficient 

INCBPO Move Base Pointer 0 to first 

delay tap 

INCBP1 ; Move Base Pointer 1 to 

second coefficient 


Next instruction is repeated N—2 times, 


N = 

Number of taps 

ADDF WRO, M 

Add previous multiplier 
result to summation 

MOV KLR1, RAMO 

Load next delay tap and 
coefficient to multiplier 

INCBPO 

Move pointer to next delay 
tap 

INCBP1 , 

Move pointer to next co¬ 
efficient 

Repeat above instruction N—2 times 

ADDF WRO, M 

Add (N—l)th product, end 
condition 

MOV KLR1, RAMO 

Load last sample and 
coefficient to multiplier 

1NCBP1 ; 

Move pointer to first 
coefficient (modulo wrap) 
but do not increment delay 
tap pointer so that circular 
buffer can be implemented 

ADDF WRO, M ; 

Add final product to 
accumulated sum in WRO, 
next time through, taps will 
delay by one 


Figure 5. Code for the FIR filter. Example 2—straight in-line code. 


66 


IEEE MICRO 





























































Biquad difference equations: 

W[n] = X[n] - [B1 * W[n—1 ]] - [B2 *W[n-2]] 

Y[n] = W[n] + [A1 * W[n—1]] + [A2 * W[n-2]] 

Figure 6. Flow diagram for the 

---- biquad filter. 


delayed samples using a temporary register, the RAM 
pointers cycle through the table. The modulo count causes a 
wrap-around, eliminating the need to test for the end of the 
buffer. 

If an application requires real-time speed, straight in-line 
coding can be employed. This way of structuring a program 
is very similar to a circular buffer except that it does not use 
a loop counter and branch instruction, which are un¬ 
necessary. Figure 5 shows the straight in-line code for the 
FIR filter. 

Biquad filter. One of the most common operations for a 
digital signal processor is the biquadratic filter. The biquad 
filter is a two-tap FIR filter with feedback. The signal flow 
diagram for this filter (Figure 6) shows that for each new in¬ 
put sample, four multiplications, four additions, and two 
delays are needed to generate each output value. The code 
for the biquad filter is shown in Figure 7. The ASP can per¬ 
form this filter in six cycles, or 0.9 /xs. Note that the ASP 
processes fixed-point and floating-point numbers with the 
same speed. However, floating point avoids the need for 
overflow detection coding, which can slow the system. 

The initial conditions for biquad filtering require the 
ROM pointer (RP) to be set at the top of the coefficient 
table, the RAMO base pointer (BPO) to be set to the first 
delay tap, and the new input sample to reside in working 
register 0 (WRO). The result will be placed in WR1. One can 
cascade biquad sections by either adding an instruction to 
move the result back to WRO or copying the biquad filtering 
routine and reversing WRO and WR1. 

The biquad algorithm can be rewritten to use RAMI in¬ 
stead of the fixed internal data ROM. If this is done, the 


Note: Instructions are separated by horizontal 
lines. Multiple entries represent subfields of 
a single instruction. 

MOV LKRO, ROM 

K = W(n—1), L = —B1 (Load 
Multiplier) 

CLR WR1 

Clear WR1 for summation 

DECRP 

RP points to —B2 

INCBPO ; 

Base Pointer moves to 

W(n—2) 

ADDF WRO, M 

WRO = X(n) - (B1 ’ W(n-l)) 

MOV LKRO, ROM 

K = W(n—2), L = -B2 

DECRP; 

RP moves to A2 (BPO still at 
W(n—2)) 

ADDF WRO, M 

WRO = X(n) - (B1 ' W(n-l)) 
- (B2 • W(n—2)) = W(n) = 
New W(n—1) 

MOV LKRO, ROM 

K = W(n—2), L = A2 

DECRP 

RP moves to A1 

DECBPO; 

BPO moves toW(n-l) 

ADDF WR1, M 

WR1 = 0 + (A2 • W(n—2)) 

MOV LKRO, ROM ; 

K = W(n-l), L = A1 

ADDF WR1, IB 

WR1 = WRO (New W(n—1)) 

+ (A2 * W(n—2)) 

MOV RAMO, WRO 

Save New W(n—1) in place in 
RAM 

DECRP 

RP points to top of next 
stage’s table 

INCBPO ; 

BPO moves to W(n—2) 

ADDF WR1, M 

WRl = New W(n-l) + (A2 * 
W(n—2)) + (A1 • Old 

W(n-l)) 

MOV RAMO, K 

Old W(n—1) stored to 

W(n—2) 

INCBPO ; 

BPO moves to W(n—1) for 
next stage 


Figure 7. Code 
for the biquad 
filter. 
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Figure 8. Flow diagram for the 
FFT butterfly. 



coefficients can be updated for applications such as adap¬ 
tive filtering and equalization. 

Fast Fourier transform. Another common DSP operation 
is the fast Fourier transform, or FFT, which is a mathemati¬ 
cally efficient method for performing the discrete Fourier 
transform. The basic operation of the FFT is the “butterfly,” 


which consists of a two-point FFT (Figure 8). Here, the in¬ 
put data (x,y) and output data (X, Y) are complex. The FFT 
algorithm takes advantage of the base and index pointers of 
the ASP, using the index register as an offset to simplify ad¬ 
dressing of the butterfly. It sets the base pointer to the first 
value, x, and the index pointer to the offset of the second 
value, y. Note that because the algorithm employs decima- 


Initial Conditions: 

RAMO contains real part of input data 

RAMI contains imaginary part of input data 

Base Pointer 0 (BPO) = 1H (Hex) 

Base Pointer 1 (BP1) = 1H 

Index Register 0 (IRO) = 10H 

Index Register 1 (IR1) = 10H 

ROM Pointer (RP) = 2H 

Special ROM Pointer Increment = 2 (used with 
INCBRP mnemonic) 

CLR WRO 

Working Register 0 (WRO) 
clear 

MOV LKRO, ROM 

K *- y r L ~~ cos(z) 

RPINC 

ROM Pointer = 3H 

SPCBI1 ; 

RAMI uses Basel + Index 1 
(RAMI = 11H) 

ADDF WRO, M 

WRO = cos(z) • y r 

MOV KLR1, ROM 

K sin(z) L y* 

SPCBPO; 

RAMO uses BaseO (RAMO = 

1H) 

ADDF WRO, M 

WRO = cos(z) * y r + sin(z) * 
yi 

MOV WR1, RAMO 

WRl *— x r 

SPCBP1; 

RAMI uses Basel (RAMI = 

1H) 

ADDF WRO, RAMO 

WRO = x r + cos(z) • y r + 
sin(z) • yj 

MOV WR3, RAMI ; 

WR3 — x, 

SUBF WR1, IB 

WRl = x r — (cos(z) ' y r + 
sin(z) • yj) 

MOV NON, WRO 

Use value on internal data 
bus (IB) 

SPCBIO ; 

RAMO uses BaseO + Index 0 
(RAMO = 11H) 

CLR WR2 

Clear Working Register 2 

MOV LKRO, ROM 

K — sin(z) L — y r 

RPDEC 

ROM Pointer = 2H 

SPCBI1 ; 

RAMI uses Basel + Index 1 
(RAMI = 11H) 


SUBF WR2, M 

MOV KLR1, ROM 
SPCBPO; 

WR2 = —sin(z) ' y r 

K cos(z) L ■— yi 

RAMO uses BaseO (RAMO = 

1H) 

ADDF WR2, M 

WR2 = -sin(z) * y r + cos(z) * 
yi 

MOV RAMO, WRO 

RAM0( 1H) — WRO (output 
real part of X) 

SPCBP1 

RAMI uses Basel (RAMI = 

1H) 

SPCBIO ; 

RAMO uses BaseO -1- Index 0 
(RAMO = 11H) 

ADDF WR2, RAMI 

WR2 = X| - sin(z) * y r + 
cos(z) ' yi 

MOV RAMO, WRl 

RAMO(llH) •*— WRl (output 
real part of Y) 

SPCBPO; 

RAMO uses BaseO (RAMO = 

1H) 

SUBF WR3, IB 

WR3 = X| — (-sin(z) ’ y r + 
cos(z) * yi) 

MOV NON, WR2 

Use value on internal data 
bus (IB) 

INCBRP; 

Increment ROM Pointer by 2 
(uses special 2 n ) 

MOV RAMI, WR2 

RAM1(1H) *— WR2 (output 
imaginary part of X) 

SPCBI 1 ; 

RAMI Uses Basel + Index 1 
(RAMI = 11H) 

MOV RAMI, WR3 

RAM(llH) — WR3 (output 
imaginary part of Y) 

SPCBIO 

RAMO uses BaseO + IndexO 
(RAMO = 11H) 

INCBPO 

Increment Base Pointer 0 
(BPO = 2H) 

INCBP1 ; 

Increment Base Pointer 1 
(BP1=2H) 


Figure 9. Code for the FFT decimation-in-time butterfly. 
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tion in time, the computed values of x and y can be stored 
back in their original locations. By simply modifying the 
base and index pointers, the algorithm can compute another 
butterfly with the same instructions it used to compute the 
first butterfly. Figure 9 shows the code for the FFT butter¬ 
fly. 

In an FFT on the ASP, the real part of a value is stored 
in RAMO while the imaginary part is stored in RAMI. The 
“twiddle factors” are known constants and therefore are 
stored in the internal coefficient ROM. The data are repre¬ 
sented as follows: 


Input data: 

x = x r + )Xi and y = y r + j y x 
(r = real, i = imaginary) 

Twiddle factor: 

Wk = cos(z) - jsin(z) 

(where z is a function of the order of 
the FFT performed) 

Output data: 

X - X r + )X { and Y = Y r + j Y { 
where 

X r = x r + cos(z) * y r + sin(z) * y t 

Xj = Xj + cos (z) * y\ -sin(z) * y r 

Y r = x r -cos(z) *y r -sin(z) * y { 

Y[ = Xi -cos (z) * y-, + sin(z) * y r 

In our example the butterfly takes 12 instruction cycles. 
In general, an «-point FFT will require n/2 butterflies per 
stage and log 2 («) stages. At 150 ns per instruction, a 
512-point complex FFT should take 

(512/2) butterflies/stage * log 2 (512) stages 
* 12 cycles * 150 ns 
= 256 * 9 * 12 * 150 ns 
= 4.1 ms. 


The actual benchmark is 4.7 ms, including the input and 
output of the 512 complex values and the modification of 
the base and index pointers. 
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The ^PD77230 ASP is supported by several development 
tools. A relocatable assembler is currently available and 
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Software copyright developments 

Structure of computer programs protected by copyright; implications for “clean rooms”; 
microcode protectable by copyright 


A s reported in the last issue of 

IEEE Micro (p. 66), the US ap¬ 
pellate court in Philadelphia held 
that copyright protection of a computer 
program goes beyond just the lines of 
code contained in the computer pro¬ 
gram. It also covers some aspects of the 
“structure” of the program. 

In Whelan Associates, Inc. v. Jaslow 
Dental Laboratory, Inc. , the trial court 
found the defendant guilty of copyright 
infringement because the visual screens 
displayed by the two parties’ programs 
were almost identical in format, and 
naive users and prospective customers 
could not tell the difference between 
what the two systems did. 1 The court of 
appeals has now upheld this judgment, 
but on the basis of somewhat different 
reasoning. The bottom line, however, is 
that the plaintiff/copyright owner in this 
case is now adjudged to be entitled to 75 
years of exclusive right to market com¬ 
puter programs having the structure 
(however that term is defined) of the 
registered program. 

There was no copying of code in this 
case. Among other things, the computer 
programs were written in different lan¬ 
guages, so that there could not be any 
byte-for-byte or other literal copy of the 
code. The original program was written 
in EDL, or Event Driven Language, and 
the infringing program in Basic. It is 
unclear from the opinions whether it is 
possible to “translate” an EDL program 
into Basic, or whether instead the pro¬ 
gram must be greatly rewritten to port it. 
In any event, it appears to have been 
conceded or assumed here that the code 
of the infringing program was not a 1:1 
mathematical transform of the original 
code. 

The court of appeals did not explain 
what it meant by a computer program’s 
structure in general or what aspects of 
program structure are to be protected 
against a second comer. In its opinion 
the court did give examples, however, of 
objectional copying of structure. A 


prominent example was the “file struc¬ 
ture,” or what we might term the selec¬ 
tion of fields (for database types, at¬ 
tributes for more theoretical computer 
science types). In the case of what seems 
to be an invoicing program, the defen¬ 
dant appears shamelessly to have copied 
from the plaintiff the plaintiff’s original 
selection of all of the following fields: 

• description of item sold, 

• number of items sold (n), 

• unit price (p) 

• extension (e = n * p), and 

• total amount to be billed (sum of 
e’s). 

In addition the defendant imitated plain¬ 
tiff’s practice of setting a flag in the 
record after invoicing a customer, to pre¬ 
vent reinvoicing the customer for the 
same bill. 

In these blatant circumstances the 
court of appeals felt that it had no alter¬ 
native but to hold the defendant guilty 
of copyright infringement for copying 
the plaintiff’s creative work. Although 
the court did not mention it, probably 
the copying of format went even farther 
and extended to such plagiarism of struc¬ 
ture as putting the customer’s name and 
address on the invoice and the date on 
which the invoice was prepared. 

What the court did, as this example 
shows, should give considerable pause to 
would-be copyists of commercially suc¬ 
cessful computer programs. What the 
court said by way of explanation should 
slow them down even more. 

Before the court could hold the file 
structures of the plaintiff’s computer 
program to be protected by the copyright 
registration here, the court had to assure 
itself that it was protecting the “expres¬ 
sion” rather than “idea” of the pro¬ 
gram. This is a fundamental—perhaps 
the fundamental—principle of copyright 
law: Copyrights protect authors’ expres¬ 
sions of ideas, not their ideas themselves 

Therefore, despite the absence of any 
examination of a work for technical ad¬ 


vance or creative merit (as occurs under 
the patent system), there should be no 
public concern over proliferation of 
monopoly because of grants of copyright 
protection. Such monopolies cannot oc¬ 
cur, because everybody else is free to 
create and secure a copyright in his own 
individual expression of the same idea, 
and an infinity of different possible ex¬ 
pressions of any idea exists. 2 Ideas re¬ 
main freely available to everyone, and 
they are not locked up by the copyright 
laws; only authors’ particular personal 
expressions of ideas are reserved by 
copyrights. 

The court conceded that “it is fre¬ 
quently difficult to distinguish the idea 
from the expression” in a work of 
authorship, or even “elusive.” But the 
court found that it was easy to do so in 
the case of a computer program, by 
means of a new legal test that the court 
had devised. Any choice in writing the 
program that is not necessary to achieve 
the purpose of the program is expres¬ 
sion; any choice necessary to achieve the 
purpose is idea. 

This “rule,” of course, simply changes 
the discussion from what is idea and 
what is expression to what is necessary 
and what is not necessary to achieve the 
purpose of the program, so that the defi¬ 
nition of the two terms, “necessary” and 
“purpose,” must be addressed. For 
some reason the court felt that thus shift¬ 
ing the focus of discussion would further 
the analysis. 

Of the two terms, purpose is by far the 
less objectively definable; it is indeed the 
pea under the shell at the legal carnival. 
Purpose, like idea, can be defined nar¬ 
rowly or broadly; it is an accordionlike 
concept. It can be a precise species or the 
most indefinite genus. 

For example, what is this yellow- 
colored object I hold in my hand? Is it 
an instance of tangible matter, organic 
matter, vegetable matter, a fruit, a citrus 
fruit, a lemon, or what? Suppose that it 
matters, in the context of a court’s deci- 
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sion whether Peter must pay money to 
Paul, whether the object that you hold in 
your hand is the “same” object as mine. 
Must Peter pay if you have a goldfish, a 
squash, a banana, an apple, a grapefruit, 
or another lemon? And if it is another 
lemon, what if one is a California lemon 
and the other a Florida lemon, a lemon 
from the lemon tree adjacent to mine, or 
from the next branch on the same tree? 
And if I hand you my lemon, is it still 
the same lemon a few minutes or days 
later? Suppose, further, that you en¬ 
hance, adapt, debug, or otherwise 
modify the lemon? 

Now I suppose that you can say that 
there are practical answers to these meta¬ 
physical questions, so that whether you 
hold a fish, a squash, a banana, or a 
debugged and enhanced lemon, the an¬ 
swer to the question of what our held 
objects are (or to what category we will 
assign them in our discussion) will de¬ 
pend on the context in which the ques¬ 
tion is asked. Are we concerned with 
paperweights, missiles, food, dessert, 
main course, something to put into iced 
tea, or what? 

But that is not how the court of ap¬ 
peals approached the question of defin¬ 
ing the purpose of these programs. The 
court said that the application of its new¬ 
ly proposed rule of law presented no dif¬ 
ficulties in the present case, because the 
true purpose of the parties’ computer 
programs was so clear. The purpose was 
to “aid in the business operations of a 
dental laboratory,” or to put it with even 
greater precision “simply to run a dental 
laboratory in an efficient way.” The 
structure of the computer program “was 
not essential to that task.” Accordingly, 
the structure was expression rather than 
idea, and the defendant thus committed 
copyright infringement by copying it as 
described above. 

I t requires little reflection to conclude 
that when the purpose of a computer 
program is defined that generically 
few or no features of the structure of 
computer programs will ever be neces¬ 
sary to achieve that purpose. Indeed, the 
purpose test becomes a sham or charade, 
for it inevitably leads to the same result: 
guilty as charged. The purpose test is not 
a tool for analysis; it is a ritual incanta¬ 
tion that could just as well be omitted 
for all the difference it makes in the out¬ 
come. When purpose is stated broadly 
enough, nothing is then essential to 
achieve the purpose. 

Thus, under the Whelan court’s con¬ 
cept of purpose, not even mailing an 


itemized invoice is “essential” to running 
a dental laboratory. Each item could 
have been sold COD or by a cash or 
credit-card sale. Or a messenger could be 
sent to the customer to make offers he 
could not refuse, such as, pay for the 
merchandise right now or have your 
kneecap broken. That is certainly one 
concept of efficiency. 

Even if the purpose of the computer 
program had been defined so that it was 
not unacceptably broad, that would not 
have made the test of being necessary or 
essential to achieve the purpose a correct 
test. For example, suppose we instead 
consider the purpose of the invoicing 
program to be to make it possible for 
naive dental-lab users to use a microcom¬ 
puter in preparing invoices without hav¬ 
ing to understand much about computers 
or software. Something could both be 
unnecessary to achieve that purpose and 
be an idea that the copyright laws do not 
(and should not) permit anyone to own. 
Possible examples are writing the pro¬ 
gram in C to make it run fast, using 
friendly icons, having the program 
driven by a menu in which the user 
moves the cursor to a choice and presses 
< Enter > instead of typing in an alpha¬ 
numeric code shown on the screen, and 
using the address on the invoice to 
prepare a mailing label with the same 
address. 

There is simply no logical implication 
that because something is not necessary 
other people should refrain from using it 
lest they be liable for copyright infringe¬ 
ment. To be sure, probably nothing that 
is necessary should be protected by a 
copyright, as contrasted with a patent, 
but the converse is not true. The public 
domain includes many “unnecessary” 
things. Under a free-enterprise system 
the government does not prohibit busi¬ 
nessmen from doing particular things 
just because someone else did them first; 
such a government prohibits business 
from free exercise of volition only when 
there is a stated reason for the compul¬ 
sion, usually the realization of some 
recognized public purpose. 

T he analysis in the Whelan opinion 
is so bad that it is difficult to 
know where to start criticizing it. 
Indeed, the court seems to have only one 
thing right, which I have not mentioned 
so far but should. The court was con¬ 
cerned with evidence that coding in 
general is, and in the case of these par¬ 
ticular programs was, responsible for 
only a small fraction of the total time 
and cost of developing the computer 


program—perhaps 20 percent. Organiz¬ 
ing the dataflow, partitioning out 
repetitive subroutines, and other non¬ 
code aspects of the programs accounted 
for “a tremendous amount of time” in 
developing the plaintiff’s computer pro¬ 
grams. The court indicated that the non¬ 
code aspects of the computer programs 
may have embodied most of the creativi¬ 
ty and commercial value of the pro¬ 
grams. Clearly, this view of the facts 
caused the court to conclude that these 
noncode aspects should be legally pro¬ 
tected and that the defendant should be 
made liable to the plaintiff for appropri¬ 
ating them. 

The factual premise is doubtless defen¬ 
sible, perhaps on balance the better view 
when properly refined, but the conclu¬ 
sion of copyright infringement is a non 
sequitur and the court’s legal analysis is 
still bad. That program structure (what¬ 
ever that term means) is commercially 
valuable, maybe even far more valuable 
than the particular coding, does not 
mean that the copyright statute was in¬ 
tended to or should cover that structure. 
Perhaps something should protect struc¬ 
ture, but probably not something just 
like the copyright laws, and certainly not 
something that protects the kind of 
structure on which the court fastened its 
attention. 

Any system that protects things like 
the field selection of the invoicing pro¬ 
gram described above is a terrible idea. 
Use of those fields is either already part 
of the public domain or a trivial varia¬ 
tion on it, and everybody should be free 
to use it commercially; no one should 
have the right to prevent competitors 
from “appropriating” that kind of struc¬ 
ture. Perhaps, field selection in a data¬ 
base system should never be protectable. 

Other noncode aspects of software, 
however, may well be ideas that deserve 
some sort of protection. Possible ex¬ 
amples are algorithms, flowcharts, in¬ 
struction sets, languages, and perhaps 
metaphors and icons. But their “misap¬ 
propriation” does not call for 75 years 
of injunctions, criminal penalties, and all 
the rest of the copyright arsenal. A more 
modest proposal would be in order. 3 

The Whelan court extrapolated from 
a probable social-economic need to a 
solution of its own devising. Courts are 
not suited for that task, and it is prob¬ 
ably beyond the scope of the role as¬ 
signed to courts under our system of 
government. In this case, moreover, the 
court’s solution was poorly conceived 
and likely counterproductive as well. 
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T he Whelan decision may have 
serious implications for the 
“clean-room” theory of legally 
writing a BIOS or other software. The 
clean room is an expensive procedure for 
reinventing the wheel. Under this theory 
shadow or form is exalted over sub¬ 
stance, because it is considered that the 
applicable copyright law is all shadow 
and no substance. The clean-room pro- 


The Whelan decision may 
have serious implications 
for the “ clean-room” 
theory of legally writing a 
BIOS or other software. 


cedure for writing competitive software 
(e.g., an IBM PC clone’s BIOS, a dBase 
substitute, a Lotus 1-2-3 clone) works as 
follows: 

• Teams A and B are set up to develop 
an emulator of the target software. Team 
A consists of ordinary systems analysts. 
Team B consists of computer program¬ 
mers who have never seen the code of 
the target software. (It is said that they 
must come via spaceship directly from a 
distant planet.) 

• Team A begins to analyze the target 
software. Team B is locked up in the 
clean room and protected from contami¬ 
nation by contact with Team A or the 
target software. 

• Team A studies the target software 
in detail. It determines the specifications. 
It may disassemble the code, reverse en¬ 
gineer the flowchart, and otherwise learn 
how the computer program works. 

• Team A writes a report describing 
the specifications and requirements of 
the computer program but providing no 
code. The report is supposed to be all 
idea and no expression. 

• The report is passed to Team B via 
a porthole into the locked clean room. 
Team B then writes the code called for 
in the report. Only then is Team B 
allowed to leave the clean room and be¬ 
come contaminated. 

• Team B never had access to the ac¬ 
tual code, either directly or by being told 
what it was by Team A. Team B has had 
access only to the ideas of the target soft¬ 
ware. The conclusion sought to be drawn 
is that any parallelism in code is either 


Drawing the line 

The following comments are offered by Michael A. Dailey and Henry W. Jones III, 
two Atlanta attorneys associated with Microstuf Inc. in its suit against SoftKlone 
Distributing Corp. over the latter’s Mirror clone of Microstufs Crosstalk communica¬ 
tions program. The suit is now pending in the federal district court in Atlanta. (Ed. 
note: See “MicroReuiew” in this issue for Dave Hannum’s review of Mirror.) 

“Clone makers” have tried to seize upon the recognized principle of US copyright 
law that blank forms are not proper subjects for copyright protection. The clone 
makers have tried to exploit this principle as a justification for copying the formats of 
screen displays and other user interfaces on which software innovators have lavished 
vast sums of money and hours of toil. But the clone makers disregard the notable 
exception to this rule, which holds that forms that not only record information but 
also convey information are worthy of copyright protection. 

As applied, this rule has protected such things as hospital examination forms, legal 
forms, forms for gasoline station account books, and answer sheets for psychological 
tests. These precedents are instructive in showing that computer screen formats may 
also be protected by copyright. And just as the courts have protected the “look and 
feel” of greeting cards, video games, and other pictorial or audiovisual works, we 
believe that so too should they protect the look and feel of user interfaces. 

That theory is being tested in Microstufs pending case against SoftKlone over the 
Crosstalk and Mirror communications programs. Mirror is an emulator of Crosstalk 
XVI, version 3.6. A comparison of the two programs’ screens (see Figure 1) reveals 
identically expressed communications parameters, as well as Filter, Key. and Send 
control settings. Mirror also duplicates all 87 commands used in Crosstalk and the 
biliteral alphanumerics used to abbreviate them for user keyboard entry. 

We feel that this case, and other cases like it pending in other courts around the 
country, will determine whether software clone makers will be free to copy the 
creative work of software originators in devising screens and other user interfaces. We 
consider the Whelan case a significant pointer toward increased judicial recognition of 
the importance of protecting the creativity of originators and in requiring clone 
makers to adjust their product development strategies accordingly. For example, we 
believe that the Whelan decision indicates that for Mirror to mimic program features 
of Crosstalk that are not essential to the proper functioning of a communications pro¬ 
gram is a copyright infringement, and therefore it should be enjoined by the courts 
and made subject to damages. 

This will not impair competition in the delivery of software to the public, because 
there are many alternative means of expressing the necessary commands and other 
features of such programs, and the public can just as readily learn one as the other. 
For a second comer to take a free ride on the time, effort, and expense of a software 
innovator in educating the public to the usefulness of a new type of program and in 
how to use its user interface is simply misappropriation of the business values of the 
innovator. That kind of reaping where another has sown will and should be stopped 
by the courts, as the Whelan court recognized and as its decision provided. 

M.A.D. and H.W.J. Ill 


Reader comment on these views is welcome. Do IEEE Micro's readers feel that it is 
more important financially to encourage and thus to stimulate software innovation or 
that it is more important to encourage price competition by clone makers? Who 
“owns” and is entitled to profit from the users’ efforts in having learned a command 
set or interface? Is it the creator of the command set or interface, the user, or any¬ 
body who wants to come along and exploit the user investment in learning how to 
use the interface? 

Should anyone who tries to use FI for anything but calling up the help menu be 
suppressed? Should anyone who claims a monopoly on the use of FI for help be 
suppressed? Should we all agree to use FI for help, F9 for recalculate, Q for quit, 
Y/N for yes or no?, and so on, and do so freely? Or should the first user be the only 
user, or at least be compensated by all later users? 

In the case of the biliteral terms seen in Figure 1, should the second comer be forc¬ 
ed to devise his own two-letter combinations for BReak, DEbug, Dir, EDit, NUmber, 
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(a) 



purely coincidental or the effect on the 
code of the ideas passed through the 
porthole (Team A’s report). 

Whether or not the clean-room proce¬ 
dure is a silly charade, many lawyers and 
MBAs swore by it. It was the magic for¬ 
mula for avoiding liability for copyright 
infringement. But under the Whelan 
rule, what can Team A now pass to 
Team B via the porthole of the clean 
room? It would now seem that anything 
of value that is passed will be expression, 
since idea now includes only the most 
general and abstract formulation of the 
purpose of the computer program; the 
“unnecessary” rest of the description of 
the target computer program and the 
least detail beyond the abstract purpose 
will be expression. Well, just as one good 
turn deserves another, perhaps one silly 
concept deserves its comeuppance from 
another. 


P:SOURCE.* (K 


NUmber 681-1302 



PArity, QUit? Should he come up with new names for those commands, so that 
other biliteral terms will seem appropriate? Are there practical alternatives for the ex¬ 
amples given? If there are, would it really be a good idea to introduce them? 

What about more subtle things, such as whoever thought of using labels (e.g., 
:\abel) instead of absolute line numbers to send a GOTO to the right line for a sub¬ 
routine or branch? Should we all pay him or her, or just let him or her bask, unpaid, 
in the satisfaction of having benefitted humanity? 

How far should this reasoning go? Can we draw any line sensibly between accept¬ 
able and unacceptable copying by emulators/clone makers? 


Can the legal system deal sensibly with this sort of thing? 
If yes, by copyright? 

If yes, GOTO :Buy 
If no, GOTO :How 
If no, can something else? 

If yes, GOTO :What 
If no, GOTO :End 

:Buy 

Buy gold brick, Brooklyn Bridge, etc., from nice man 
GOTO :End 


:How 

How? 

Write if you get work 
Good luck 
GOTO :End 

:What 
What? And 
GOTO :How 

:End 


A federal trial court in San Jose has 
held that microcode is copyright- 
able. NEC had sued Intel for a 
judgment declaring that the 8086/8088 
microcode either was uncopyrightable or 
that NEC did not infringe it by market¬ 
ing the V20 microprocessor chip. Intel 
then countersued for copyright infringe¬ 
ment. 4 

The court held that the microprograms 
stored in the microcode ROMs of the 
8086 and 8088 were computer programs 
under the Copyright Act, and literary 
works. It said that writing microcode 
was a “creative endeavor” and that the 
programming methodology was indistin¬ 
guishable from that employed in creating 
other types of computer programs. It 
also ruled that the fact that microcode 
has a function or utilitarian purpose does 
not make microcode uncopyrightable. 

The court left for a future date its 
determination whether the particular 
NEC microcode in fact infringed the 
copyright in the Intel microcode. NEC 
applied to the court and was given per¬ 
mission “to present evidence of ‘clean- 
room’ creation of [its] microcode.” 
(Because the San Jose court is not under 
the authority of the Whelan court of ap¬ 
peals, it is not obliged to follow the im¬ 
plications or even the holding of that 
decision.) 

It is unclear if the court has left open 
the question of whether the NEC micro¬ 
code is noninfringing because (according 
to NEC) it copies only “functional” 
parts of the Intel microcode. The court 
held that “the function performed by 


December 1986 


77 








Micro Law 


[Intel’s] 8086/8088 microprograms does 
not affect their status as copyrightable 
subject matter.” But whether the copy¬ 
right in utilitarian works extends to and 
protects their functional aspects is a 
separate question from whether the ex¬ 
istence of such aspects makes the works 
uncopyrightable. There have been cases 
in which the courts held that it was not 
copyright infringement to copy the func¬ 
tional aspects of utilitarian works, while 
in other cases courts have held that 
utilitarian works were uncopyrightable 
because of their functionality. 

M enus were protected in a recent 
“look-and-feel” copyright in¬ 
fringement decision from the 
San Francisco federal trial court. Broder- 
bund Software sued Unison World (now 
Kyocera Unison) for copying the screens 
of Broderbund’s Print Shop display 
graphics program. Unison’s Print Master 
was the product of a fallen-apart joint 
venture in porting Print Shop from the 
Apple to the IBM PC. 

According to the court, the infringing 
screens tracked the appearance of the 
originals in many arbitrary, nonfunc¬ 
tional aspects, such as relative size of 
characters on the display, layout, and 
choice of wording for phrases. The court 
concluded that an “ordinary observer 
could hardly avoid being struck by the 
eerie resemblance between the screens of 
the two programs,” since “the sequence 
of screens and the choices presented, the 
layout of the screens, and the method of 
feedback to the user are all substantially 
similar.” The court gave a list of paral¬ 
lels, some of which seem to show noth¬ 
ing (for example, that both products re¬ 
quire the user to create the front of a 
greeting card before the second page) 
and others of which seem to show copy¬ 
ing of aribitrary details (for example, 
division of a page into 13 sectors). 

The case differs from other look-and- 
feel cases in that the basis for the court’s 
judgment of copyright infringement was 
not a “literary” or textual copyright in 
the code, but a pictorial type of copy¬ 
right in the screens themselves. The 
screens were registered as audiovisual 
works, so that the look and feel in con¬ 
troversy was that of the pictures as pic¬ 
tures rather than that of the computer 
program as such. That is an important 
difference. It is generally recognized that 
pictures may have a difficult-to-articulate 
but nonetheless protectible look and feel. 
It is more open to question, however, 
that a wholly symbolic work has a pro¬ 


tectible look and feel that is anything 
other than its unprotectible concept or its 
“ideas.” 

At first blush the copying here seems 
excessive and unnecessary. It does not 
seem to be dictated by a need to conform 
to user habit or by some other functional 
consideration. However, some industry 
members claim that the decision will 
adversely affect the development of 
enhanced competitive products. 


I n summary, in the past few months 
there has been a considerable expan¬ 
sion of the scope given copyrights in 
protecting various noncode aspects of 
computer programs. 5 In some instances, 
such as the copying of screens copy¬ 
righted for their pictorial content, the de¬ 
velopment seems incremental and unob¬ 
jectionable. In other cases, however, the 
result seems to be that patentlike protec¬ 
tion of functionally important or public- 
domain subject matter is being awarded, 
merely on the basis of a copyright 
registration without any examination by 
an expert body, or even a court, into 
whether the work displays enough 
technical advance or merit to justify the 
award. The result may be to deprive 
other software developers of the privilege 
of using things (1) that contribute to 
making their own technological ad¬ 
vances, to the public benefit, and (2) in 
which they may be properly entitled to 
share, unless and until the first comer is 
able to obtain a patent on the feature. 


Courts protecting look 
and feel have a need to 
protect commercial 
values of software 
innovators. 


The courts protecting look and feel or 
other noncode aspects of computer pro¬ 
grams have responded to a need that 
they perceived to protect commercial 
values of software innovators. They have 
been convinced, probably rightly, that a 
major portion of the time, effort, and 
expense that goes into developing a com¬ 
puter program concerns noncode aspects 
of the program. These aspects may in¬ 


clude user interfaces, menu choices, pro¬ 
gram metaphors (and icons, such as the 
famous or infamous file-deletion garbage 
can), and other things termed “struc¬ 
ture.” The courts have also been per¬ 
suaded that protecting these things under 
the copyright law will promote techno¬ 
logical progress in software, to the 
benefit of the public, and that refusing 
to allow such protection will discourage 
investment in software innovation. The 
empirical evidence for this has been 
slender, and it is unclear whether the net 
effect of applying copyright law in this 
manner will be a plus or a minus on soft¬ 
ware progress. 

Applying copyright law is the only 
available quick patch, for no other con¬ 
venient federal system exists right now. 6 
Many proponents of such protection are 
unwilling to await legislative action, for 
it may never come or they may go bank¬ 
rupt as a result of competition from 
“software clone makers” in the mean¬ 
time. Clone makers, of course, take a 
different view; they say that this patch is 
a kludge. So, too, may more disinter¬ 
ested observers. The following is a quo¬ 
tation from a recent essay by Congress¬ 
man Robert Kastenmeier, the chairman 
of the House subcommittee responsible 
for intellectual property matters: 

In studying the problem of how best 
to protect semiconductor chip prod¬ 
ucts and in devising a legislative 
scheme that solves the problem and 
also promotes the public interest, I 
found the study and the solution to be 
a paradigm of the industrial property 
protection problem for all new tech¬ 
nology at the end of the Twentieth 
Century and at the rise of what may 
be a new information society. One of 
the things that I believe we learned 
was that the era of “shoehorning,” 
or of pouring new wine into old 
legislative bottles, should end. We 
learned...that different bodies of in¬ 
tellectual property law strike different 
respective equilibrium points for bal¬ 
ancing the interests and values at 
stake, that what is an acceptable or 
desirable balance of interests for 
authors and artists is not necessarily 
acceptable in the case of industrial 
property products, and that “it would 
be pure serendipity for a law designed 
to deal with literary and artistic rights 
to realize the needs of new technology,” 
We are no longer so blindly self- 
confident as to expect that such a 
serendipitious result will automatically 
occur.... 7 
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111 Lock Drive 
Marlboro, MA 01752 
(617) 480-5370 


MICRQ/ 


December 1986 


79 



























Department 


MicroStandards 

Michael Smolin/Smolin &. Associates/3428 Greer Road, Palo Alto, CA 94303 

Publish and/or Perish (Or, Who Wants To Use a Trial-Use Standard?) 


I n the annals of the Computer Society 
lie numerous stories of the attempts 
to publish drafts of proposed stan¬ 
dards. Few of these attempts were suc¬ 
cessful—most were (and are) not. The 
arguments, pro and con, about publish¬ 
ing draft standards have taken on 
religious overtones. Adherents to each 
argument may even refuse to see little, or 
any, value in the other arguments. 

Typically, a working group in the pro¬ 
cess of developing a standard needs to 
publish a draft to solicit comments from 
a broad base of peers. Standards ad¬ 
ministrators in the IEEE and especially 
in the Computer Society fear that, no 
matter how disclaimers are worded, 
someone will insist on mistaking the pub¬ 
lished draft for an adopted standard. 

Let’s examine the positions with a 
hypothetical dialogue... 

Standards developer (SD): The most 
important feature of IEEE standards is 
that they are standards of consensus. We 
must see to it that “every attempt is 
made to involve all interests in the activi¬ 
ty” so that “it can be presumed that the 
document represents a consensus of all 
interests concerned with the scope of the 
standard.” 1 

The working group developing the 
standard and the sponsoring technical 
committee often feel that true consensus 
requires broadly circulating the draft to 
stimulate comments from all facets of 
the interested parties. This is done best 
by printing the draft under consideration 
in a widely read professional publication 
such as IEEE Micro or Computer. 

Standards Administrator (SA): There 
are other ways to get the wide distribu¬ 
tion you feel you need—without publish¬ 
ing the draft of a proposed standard. 

You can publish articles about the draft 
and about the working group’s resolu¬ 
tion of opposing interests. Or, you can 
recommend that the proposed draft be 
adopted as a trial-use standard. 

A trial-use standard has a two-year life 
span. During that time, it will be treated 
by the IEEE as a true standard. It will be 
published and offered for sale by both 
the IEEE and the Computer Society. 
Comments received by the working 


group during the trial-use period repre¬ 
sent the public comments that you need 
to achieve consensus. You then can re¬ 
vise your draft, reballot it, and resubmit 
it—this time for adoption as a full-use 
standard. 

SD: You mean that by accepting a 
two-year trial-use period as a delay, we 
can go for a full standard without really 
achieving consensus—after all, few in the 
microcomputer area would bother imple¬ 
menting or examining in detail a “trial- 
use” document? Its very name implies 
that it likely will change just about the 
same time that a new and complying 
product could get to market. Note also 
that IEEE standards go on to become 
ANSI standards and often become 
international standards. 

About an article on a draft—just who 
will write it? We have already devoted 
hundreds of our volunteered hours to 
writing the draft. Now you also want us 
to write an article about it for publica¬ 
tion. My department’s budget doesn’t 
stretch that far. I still have to do my 
regular work and satisfy my managers. 

SA: There is great concern that a pub¬ 
lished draft might be mistaken for an ap¬ 
proved standard. This seems to happen 
even when care is taken to include dis¬ 
claimers, expiration dates, etc. Official 
policy about publication of drafts of 
standards is that ‘ ‘the practice is 
deprecated by the Standards Board.” 2 

SD: There is another reason that we 
feel publication of drafts of standards 
should be unhindered—FAIRNESS. 
Working groups often have few mem¬ 
bers—perhaps a dozen or so, commonly 
only a handful of participating members. 
Now, I mean members of all categories, 
including nonparticipating observers. 
These members become an informed 
elite who have an information advantage 
over others. This translates into unfair 
advantage in marketing and technology. 
If the IEEE and the Computer Society 
truly serve their professional member¬ 
ship, it should see that all of these stan¬ 
dards development activities get wide¬ 
spread dissemination, including the pub¬ 
lication of drafts. It’s only fair. 


SA: Fairness as you see it may be a 
luxury that we cannot afford. There is 
no requirement to force the membership 
to be knowledgeable about standards de¬ 
velopments. The publishing of additional 
hundreds of pages each year in society 
magazines is a cost not warranted by 
member interest. 

SD: Well, then let us submit the draft 
for publication in other trade journals 
after giving the society’s publications the 
first right of refusal. Publishing the draft 
also updates users about the state of the 
development of the standard. Many in¬ 
dividuals working from third-hand and 
out-of-date information do not know 
how the specifications have been 
changed over several drafts. We do a 
disservice by not explicitly telling them 
about the changes. 

SA: You can certainly supply them 
with a current copy of the draft as a 
working document, upon request. There 
is no obligation to do that, but each 
project may (with approval) supply 
copies of their documents and charge for 
that service. 

W ell, working professional, what 
do you think? Does the publica¬ 
tion of draft standards, the 
resulting spread of information, and the 
opportunity to contribute comments dur¬ 
ing development outweigh the risk of 
mistaking a draft for an approved stan¬ 
dard? Or, do you think the risk of ac¬ 
cidentally working to an unapproved 
draft (proposed standard), with the pos¬ 
sibility of wasted resources and efforts, 
outweighs the arguments to publish draft 
standards? 
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SDIC conference to explore storage/ 
interface devices, systems architectures, 
networks 


Editorial Board 
welcomes new 
members 


Computer systems designers, in¬ 
tegrators, and specifiers, value-added 
resellers, value-added OEMs, and high- 
volume end users can look forward to 
the first meeting next February of the 
Systems Design and Integration Con¬ 
ference—an event in the planning stage 
for two years. Specifically designed to 
focus on practical solutions to the needs 
of these groups, SDIC will offer three 
days of technical sessions, tutorials, and 
exhibits of computer-related products. 

Each day of the conference highlights 
one of three technologies: storage and 
interface devices, systems architectures, 
and networks. Phil Devin, senior analyst 
for Dataquest’s Computer Storage In¬ 
dustry Service, will open the first day’s 
session, speaking on that industry, its 
products, and its market trends. Planned 
technical sessions include software design 
and integration for 32-bit microproces¬ 
sors, architectures for computer graphics 
(led by IEEE Micro editorial board 
member Richard Mateosian), and stor¬ 
age trends (led by IEEE Micro editorial 
board member Kenneth Majithia). 

Walter J. Utz, Jr., Hewlett-Packard 
software engineering training manager, 
will open the next day’s meeting with 
comments on the impact of RISC on 
computer design. Session topics include 
the impact of GaAs on systems design, 
32-plus bus trends and choices, and sys¬ 
tems design with 32-bit microprocessors. 

SDIC’s last day will begin with re¬ 
marks from J. Edward Snyder, general 
manager, TRW Information Networks 
Division. Solving problems with LANs, 
multivendor systems integration, MAP: 
the key to an integrated factory, and 
fiber optics systems are some of the ses¬ 
sion topics. 

Tutorials 

SDIC tutorials are designed to help 
participants learn about the design and 


implementation of databases, motion 
control systems, Al/expert systems, and 
systems design methodologies and 
CAE/CAD tools, among other subjects. 
Tutorial instructors include consultant 
Herb Edelstein of Digital Consulting 
Associates; Stanford University associate 
professor Gio Wiederhold; researcher, 
author, engineer, and consultant Jacob 
Tal; and the director of University of 
Santa Clara’s Center for Information 
Storage Technology, A1 Hoagland. 

Registration information 

The Systems Design and Integration 
Conference is scheduled for February 
10-12, 1987, at the Santa Clara Conven¬ 
tion Center in San Jose, California. 
Wescon, the Los Angeles and San Fran- 
ciso Bay Area councils of the IEEE, and 
the southern and northern California 
chapters of the ERA cosponsor the 
event. 

General registration, which includes 
the opening session and exhibit areas on 
all three days, is $5. Preregistration fees 
for opening session, exhibit area, and 
each day’s technical sessions are $80 per 
day. The seven attendance-limited tutori¬ 
als cost $150 to $180. 

More information on SDIC can be ob¬ 
tained from the conference producers, 
Electronic Conventions Management, at 
(213) 772-2965. 
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Editor-in-Chief James J. Farrell III 
announced the acceptance of two new 
members to the IEEE Micro editorial 
board, Marlin H. Mickle and Yoichi 
Yano. Mickle assumes the duties of New 
Products editor beginning with this issue. 

Marlin H. Mickle is professor of elec¬ 
trical engineering at the University of 
Pittsburgh, where he has also held the 
positions of graduate program coor¬ 
dinator and director of the Computer 
Engineering Program. He is active in the 
areas of digital computer systems and 
high-technology applications. 


Marlin H. Mickle 




Mickle received his BS, MS, and PhD 
degrees in electrical engineering from the 
University of Pittsburgh in 1961, 1963, 
and 1967. 

Yoichi Yano has been a microproces¬ 
sor architecture designer at Microcom¬ 
puter Products Division, NEC Corpora¬ 
tion, since April 1980. During the past 
four years he participated in the architec¬ 
ture and system designs of the V60/V70 
32-bit VLSI microprocessors. His current 
interests include VLSI processor archi¬ 
tecture and highly parallel computing 
structures. 
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Letters 

continued from page 5 

because of the adoption of power-line 
multiplexing as the predominant method 
for home device control, which has in¬ 
adequate bandwidth for computer ap¬ 
plications in general. Later the study 
group effort moved more to a higher 
performance backplane bus. That was 
the precursor to Andy’s writing the PAR 
for the 896 project, which he then 
chaired. Andy preferred to use the more 
modest nomenclature, “Advanced Back¬ 
plane Bus,” rather than “Future Bus,” 
which the rest of us had used for a long 
time previously. He is correct in his 
assertion that the name on the PAR was 
not Future Bus. 

The contribution of Matthew Taub to 
the arbitration scheme used in Fastbus 
(IEEE 960), S-100 (IEEE 696), and 
Futurebus (IEEE P896) was known to us 
in the 696 working group very early after 
its suggestion by David Gustavson. Evi¬ 
dently Andy didn’t happen to attend the 
working group meeting at which that in¬ 
formation was stated. 

The scene depicting the 896 meeting in 
Boulder was based on information Andy 
provided to me over the telephone. I’m 
glad that he points out that Rollie Linser 
was the inspiration for a serial bus that 
can substitute for the parallel bus for 
reliability. Apparently this was not in my 
notes, and so I tagged the balloon to 
Andy as chair expressing a general char¬ 
acteristic. That perhaps can be con¬ 
sidered artistic verisimilitude—and 
possibly justifies the question mark in 
the title after “Historical.” Many of the 
details contained in the presentation were 
my personal recollections going back as 
far as a decade and which had no written 
material anywhere for support. J. D. 
Nicoud spent a year’s sabbatical in Palo 
Alto in the early eighties and attended 
MSC meetings while he was here. Andy 
is correct in saying that he came after the 
meeting at which Andy resigned. 

The statement Andy gives regarding 
his resignation was because the MSC’s 
“general failure to prevent the holders of 
minority viewpoints from indefinitely 
delaying progress.” As one of the 
minority in the 896 working group I’d 
like to state that we had no intention of 
indefinitely delaying progress. Rather we 
felt that the parallel bus as it was then 
proposed was not a significant advance 
on the state of the art and had prob¬ 
lems with driving the bus lines, problems 
which were not resolved. I personally 
was unhappy that Andy saw fit to resign 


rather than to resolve the differences 
within the working group. 

Now let’s see where Andy and I do 
agree. It is now 1986, eight years after 
the Futurebus effort started and no 896 
draft has been approved by the MSC! I 
credit Paul Borrill with working extreme¬ 
ly hard as chair, as had Andy, but Paul 
estimated that at most one year would be 
needed to finish the draft when he 
assumed the chair. It has taken at least 
three years more than that. When the 
realities of industrial competitiveness are 
taken into account, a company simply 
may not be able to wait for the IEEE 
standards development process to take 
its path, and time. The 802 efforts to 
develop LAN standards, however, con¬ 
stitutes a good counter example of the 
computer industry working constructive¬ 
ly within the IEEE framework to 
develop badly needed standards in a 
reasonable time scale. 

What has come out of those three ex¬ 
tra years spent on the 896 effort? 

• The new bus drivers developed by 
R. Balakrishnan of National Semicon¬ 
ductor set a new level of performance in 
backplane buses. 

• The Taub arbitration scheme has 
been enhanced to incorporate the sugges¬ 
tions of Keith Britton improving fairness 
among competitors. 

• A new parallel-bus protocol con¬ 
taining fast two-edged handshakes from 
Fastbus was worked out by John Theus 
of Tektronix. 

• The parallel-bus protocol was ex¬ 
tended to provide services needed by 
caches. 

Are these efforts worth it? Only time 
will really tell. Some of the other 32-bit 
buses now in existence have used features 
first hammered out in the 896 commit¬ 
tee. Under Andy, P896 chose the 
Eurocard format, which VME and MB 2 
and Nubus followed. MB 2 has also 
chosen the Taub arbitration scheme. 
Maybe in the future some of these other 
buses will see fit to retrofit to the higher 
performance possible with the 896 bus 
drivers. 

I did not want to prepare another 
WOW (Write Only Writing) article, 
which only the writer would really read 
and so chose the cartoon format for the 
presentation. Perhaps Andy is right and 
professional society journals should not 
contain such a format. But the content 
was the best I could recall, and the effort 
needed to draw it was about three times 
that which a conventional written pre¬ 
sentation would have required. 


IEEE standards during the 
Great Bus Wars—another 
view 

(Editor’s note: When I approved the 
August cartoon-feature MicroStandards 
column for publication, it was intended 
to provide the reader with a depiction of 
the events of the past decade in a light, 
and I hoped, humorous manner. Any 
misrepresentation or offense given was 
not intentional and is regretted. Allison 
has responded to my request to submit 
his account of the events he has noted 
and that account follows.-—J.F.) 

The August issue of IEEE Micro 
presented one view of the work done 
within the Computer Society’s Micropro¬ 
cessor Standards Committee. As some¬ 
one who joined that body in November 
1977, just three months after its incep¬ 
tion, and was an active participant for 
over four years, I would like to offer a 
different one. 

The genesis of the Microprocessor 
Standards Committee, originally 
established as a subcommittee of the 
Computer Society’s Standards Commit¬ 
tee, lay in the loss of control by its 
developer (MITS Inc.) over the specifica¬ 
tion of what came to be known as the 
S-100 bus. Although the subcommittee 
immediately began working on a number 
of other proposed standards, microcom¬ 
puter system buses have remained a ma¬ 
jor part of its work—unfortunately, with 
very little practical result. After almost 
10 years of effort, only two microcom¬ 
puter bus standards, S-100 (696) and 
Multibus (796), have been adopted by 
the IEEE. 

Much else of what was presented as 
history in the August issue is in conflict 
with my personal knowledge and the 
public record. For example, the origins 
of what became IEEE-STD-802 (Local 
Networks) and the P896 (originally the 
Advanced Microcomputer System 
Backplane Bus) are misrepresented. The 
following is the text of the first two 
paragraphs of a report on the status of 
the P896 activity in the February 1981 
issue of Micro (“Status Report on the 
P896 Backplane Bus,” Andrew A. Alli¬ 
son, p. 67): 

“A subcommittee on microprocessor 
standards was set up by the IEEE Com¬ 
puter Society in August 1977. By the 
middle of 1978, the committee’s efforts 
toward developing standard specifica- 
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tions for the S-100 (P696) and Multibus 
(P796) buses had made clear the need to 
consider future systems bus require¬ 
ments before the emergence of yet 
another generation of de facto but in¬ 
completely specified and incompatible 
buses. 

The working group set up to consider 
this need [This was the Future Bus (not 
Futurebus) subgroup chaired by Cash 
Olsen.] concluded that the buses then 
being specified by the Microprocessor 
Standards Committee could not be ex¬ 
tended to satisfy the requirements an¬ 
ticipated for future microprocessor- 
based systems. Three major categories 
of bus—backplane, local network, and 
residential—were identified. A back¬ 
plane bus subcommittee was set up (by 
the present writer) in June 1979, and 
Project Authorization Request Number 
896 was approved by the IEEE Stan¬ 
dards Board in September of the same 
year. EDSIG—the European Distributed 
Intelligence Study Group—set up a sub¬ 
group in May 1980 to interact with the 
IEEE work. EDSIG is one of the work¬ 
ing groups supported by the Commission 
of European Communities for pro¬ 
moting standardization in the field of 
data processing.” 

This makes it clear where the P896 
and 802 activities originated. The fact 
that Maris Graube quickly took the local 
network effort out from under the 
MSC’s purview is no doubt the reason 
that it is now an IEEE standard. 

The status report was based on the 
working document for the Boulder P896 
workshop and included specification of 
the serial link feature allegedly introduced 
by me at the workshop. The position at¬ 
tributed to me in the August issue is, 
quite simply, false. Credit for the devel¬ 
opment of this feature, part of P896 
from its early days and since incor¬ 
porated into several other buses, and 
probably the most useful outcome of the 
P896 effort, belongs (as I informed 
Stewart when he was researching his 
paper) to Rollie Linser. 

The reference to Versabus in the 
August article is incorrect. Both Versa¬ 
bus and Nubus (then still in the hands of 
M.I.T.) were among the preexisting 
specifications presented to the P896 
working group as candidates for stan¬ 
dardization, but neither were felt to meet 
the processor-, manufacture-, and tech¬ 
nology-independence objectives set for 
P896. 


Similarly, the decision to present a 
proposed draft to the MSC for approval 
to distribute for public comment was the 
result of a vote of the working group. It 
is ironic that the MSC’s January decision 
to deny that request on the basis of a 
minority viewpoint has resulted in Versa- 
bus’s successor, the VMEbus, becoming 
the de facto standard 32-bit bus. The 
characterization of that vote in the arti¬ 
cle in question is, incidentally, not fac¬ 
tual—among other things, Nicoud was 
not even present! 

The fundamental reasons for the 
failure of the MSC to produce useful 
standards, in my opinion, were (and re¬ 
main) lack of understanding of the dif¬ 
ference between controlled and uncon¬ 
trolled specifications, and of the needs of 
the user communities, and the insistence 
by certain members of the MSC that the 
proposed standards incorporate their 
opinions. Perhaps the most ludicrous ex¬ 
ample of the latter was the holding up of 
the Multibus draft for months until the 
working group chairman acceded to 
Stewart’s nomenclature demands. 

As noted above, in 1977 the S-100 bus 
specification was both popular and out 
of control, with as many implementa¬ 
tions as suppliers and serious incom¬ 
patibilities between them. All of the 
other preexisting buses taken up by the 
MSC since that time have been controlled 
by their proprietors. The failure to 
recognize this fundamental difference 
was the cause of the so-called “bus 
wars,” which were (and are) primarily 
fought over efforts by the MSC to im¬ 
pose, frequently over the objections of 
the working groups actually drafting the 
standards, changes to proprietary, de 
facto standards. The outcome has been 
that the need which led to formation of 
the MSC, namely the interoperability of 
subsystems from different suppliers, has 
been met by the use of de facto rather 
than IEEE standards. 

I submit that the MSC will continue to 
fail in its obligation to provide useful, 
timely standards until it recognizes that: 
(a) microprocessor and computer manu¬ 
facturers have a (perfectly legitimate) 
commercial interest in establishing pro¬ 
prietary buses as de facto standards; (b) 
the MSC has no business ratifying such 
standards, with or without the cosmetic 
changes that are the only kind possible 
for this type of standard; (c) the plethora 
of overlapping bus specifications being 
“standardized” defeats the objective of 
standards development; and (d) the 
marketplace will continue to establish de 


facto standards if adequate (as opposed to 
wonderful) alternatives are not offered in 
timely fashion. 

The foregoing is, unfortunately, prob¬ 
ably irrelevant to microcomputer system 
bus standards development. The IBM 
PC and PC AT buses will clearly remain 
the de facto standards for 8- and 16-bit 
subsystems development for the foresee¬ 
able future. Absent something dramatic 
from IBM very soon, VMEbus’s present 
domination of the 32-bit arena will also 
be secure. In other words, the war is 
over! 

Andrew Allison 
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MicroReview 

David L. Hannum/AT&T Information Systems 


Three for Christmas 


Here are three solid products well 
worth the software buyer’s considera¬ 
tion. All three are priced right, and all 
three have received reasonably good 
reviews in the computer press. 

Mirror. This data communications 
program is or at least acts like a clone of 
Crosstalk XVI and therefore could come 
under fire as a result of recent court rul¬ 
ings. (See MicroLaw, p. 76, for com¬ 
ments on pending court action involving 
Mirror.— Ed.) In the meantime, we 
should enjoy having a reasonably priced 
PC software package that not only looks 
and acts like the de facto leader but im¬ 
proves on it. 

Mirror has all the features of Cross¬ 
talk plus several Crosstalk does not pro¬ 
vide. Mirror can run as a background 
program, allowing a user to run an ap¬ 
plication while sending large files to a 
host. It can answer calls—serve as a bul¬ 
letin board, for example—while the user 
is working. It enhances the XMODEM 
protocol. 

As far as I am concerned, the capabil¬ 
ity to run as a background program and 
the XMODEM enhancements are the 
primary reasons one should consider this 
program. 

Mirror runs well, is easy to learn and 
use (no learning required at all if the user 
already knows Crosstalk), and handles 
errors well. Mirror’s developers have 
done a good job with the manual—I 
found it easy to learn how to do a good 
VT100 emulation, for example. 

I give this product a 9 +. It’s worth a 
look—at $49.95, it provides higher per¬ 
formance for price than anything else on 
the market. 

Mirror is available from SoftKlone, 
1210 East Park Ave., Tallahassee, FL 
32301; (904) 878-8564. 

RAM-Resident Printmerge. This is 
a very specialized piece of software, 
strictly for HP LaserJet users. It allows 


the sophisticated user to take advantage 
of some of the capabilities built into the 
printer but not available through nor¬ 
mal applications. It supports line draw¬ 
ing and boxing of text and allows graph¬ 
ics, charts, and tables to be mixed with 
text. And since it is a RAM-resident pro¬ 
gram, it sits in the background until it is 
needed. 

RAM-Resident Printmerge is easy to 
load, learn, and use, and its manual is 
adequate for those who already under¬ 
stand both the LaserJet and their PC. It 
is a “must have” utility for anyone run¬ 
ning a PC/LaserJet combo, and at $124 


Programs that hide in 
RAM and emerge when 
needed are quite useful, 
but they should make us 
consider what they are 
doing to the system. 


it is not overpriced, given its capabilities. 
A solid 8, but not for the beginner. 

RAM-Resident Printmerge is available 
from Polaris Software, PO Box 28789, 
San Diego, CA 92128; (619) 489-8243. 

Referee. Programs that hide in RAM 
and emerge when needed are quite use¬ 
ful, but they should make us consider 
what they are doing to the system. Un¬ 
like programs in ROM or on disk, they 
need constant attention or they may ser¬ 
iously degrade the service a PC provides. 

RAM-resident programs take comput¬ 
er resources, particularly processor time. 
This means there may be a fight (colli¬ 


sion) between the application a user is 
running and a RAM-resident program, 
or even between two or more RAM- 
resident programs. This possibility calls 
for the services of a “referee,” a pro¬ 
gram that can tell other programs when 
to play ball (activate) and when to leave 
the game (deactivate). One such pack¬ 
age, appropriately named Referee, does 
this efficiently for some but not all 
RAM-resident packages. 

Referee is not a beginner’s tool. While 
it works well with Prokey and Sidekick, 
it does not function well with certain 
other programs and may even cause 
what its user wants to prevent, data 
losses and system crashes. One must be 
aware of its quirks. Used knowledgeably, 
Referee is a good package. Its manual is 
more than adequate for the type of user 
who should be running the program. I 
give Referee a solid 7 with the above 
reservations. 

Referee is available for $69.95 from 
Persoft Inc., 465 Science Drive, Madi¬ 
son, WI 53711-9380; (608) 273-6000. 

Next issue. Look for a review of two 
new graphics packages for the IBM PC 
and compatibles—Concorde and Picture 
Perfect. With the next issue, my term as 
MicroReview editor expires. I have en¬ 
joyed evaluating micro products, ser¬ 
vices, and books with you and receiving 
your cards and letters. I welcome my 
successors, editorial board members 
Richard Mateosian and Kenneth Maji- 
thia. I am sure they will value your sug¬ 
gestions and comments, as I did. 


Reader Interest Survey 

Indicate your interest in this department 
by circling the appropriate number on the 
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New Products 

Editor: Marlin H. Mickle/University of Pittsburgh 

Send announcements of new microcomputer/microprocessor products, and products for review, to Managing Editor, IEEE 
MICRO, 10662 Los Vaqueros Circle, Los Alamitos, CA 90720-2578. 


VRTX, Unix optional on 32-bit board 


Microbar Systems is shipping its 
MT68020 single-board computer designed 
for use in multitasking applications. The 
MT68020 features a 68020 32-bit proces¬ 
sor, the Multibus II open-system archi¬ 
tecture, and options in the operating sys¬ 
tem, DMA, memory management unit, 
and floating-point coprocessor. 

The board is available in 12.5- and 
16.67-MHz versions. On-board, 4M- 
byte, dual-ported dynamic RAM can be 
increased to 16M bytes with one wait- 
state with an expansion board. Memory 


In-circuit emulator connects 
to PCs 

An in-circuit emulator from Signum 
Systems provides real-time, transparent 
emulation for the 8031, 8051, and 8751 
microcontrollers. Model E232-51, when 
connected to an IBM PC XT or AT 
through an RS-232 interface, offers de¬ 
bugging facilities and a user interface with 
windows, menus, and mouse support. 

The real-time trace buffer of the 
emulator provides information on the 
address and data buses, status lines, 
ports 0 through 3, and 11 external user 
signals. It is capable of stopping the 
recording process after a specified num¬ 
ber of instructions or cycles so the exe¬ 
cuting program can be recorded without 
stopping the microcontroller. 

Model E232-51 with 64K bytes of 
overlay program memory is priced at 
$3195. Optional 8051 relocatable cross- 
assembler and mouse are available for 
$199.50 and $99. 

Signum Systems, 182014th Street, 
Suite 203, Santa Monica, CA 90404; 
{213) 450-6096. 

Reader Service Number 41 


is accessed either by the on-board MPU 
or by another Multibus II requester over 
the Parallel System Bus. The message¬ 
passing coprocessor supports the 
Multibus II PSB interface, and a dual¬ 
port circuit arbitrates access to RAM. 

MT68020 implements two optional 
operating systems: the AT&T Unix 
System V, Release 2.0, or the Hunter & 
Ready VRTX real-time operating sys¬ 
tem. PROM-resident software supports 
the initial download of a VRTX applica¬ 
tion from the Unix host to a VRTX pro¬ 
cessor and allows it to begin operation. 


In addition to a standard 8-bit PROM 
used to initialize the MT68020, perform 
diagnostics, and set the configuration 
registers, there are four 28-pin sockets 
for users’ implementations of firmware 
in PROM. The sockets accommodate 
sizes from the 2732 (16K bytes) to the 
27512 (256K bytes). 

OEM-quantity prices are under $2000; 
engineering samples cost $3490. 

Microbar Systems, Inc., 785 Lucerne 
Drive, Sunnyvale, CA 94086; (408) 
720-9300. 

Reader Service Number 40 


The Signum Systems E232-51 in-circuit emulator features symbolic debugging, in¬ 
line assembler and disassembler, 128K bytes of address breakpoints, and an 
11-channel user logic state analyzer. It requires a terminal for stand-alone operation 
or an IBM PC or compatible with 128-K RAM, one serial communication port, 
and a monochrome or graphics adapter card with monitor. 
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RAM modules support 
VMEbus and VMXbus 

The DSSEDPDX from Data-Sud 
Systems is a dual-ported RAM module 
that supports both the VMEbus and the 
VMXbus, revisions A and B. A -1 ver¬ 
sion is supplied with 512K bytes of 
64K x 4 SIP dynamic RAM, and a -2 
version comes with a capacity of 1M 
byte. 

The DSSEDPDX is an expanded bus, 
double VME board that occupies one 
slot. Its standard front panel (for double 
Eurocard cages) incorporates status 
LEDs for VMEbus/VMXbus access, 
write-protect switches for both buses, 
and an AMP connector for broadcast 
mode. PI and P2 are both 96-pin 
DIN41612 connectors. 

The DSSEDPDX-1 is priced at $1495; 
the DSSEDPDX-2 costs $1995. Delivery 
for the DSSEDPDX boards is from 
stock to four weeks. 

Data-Sud Systems/U.S., Inc., 5025 
South Ash Avenue, Bldg. B, Suite 5, 
Tempe, AZ 85282; (602) 345-0945. 
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The Data-Sud Systems DSSEDPDX module is a VMEbus A32 slave capable of 
driving and monitoring 32 data lines for 32-bit data transfers. The module’s 
VMXbus base memory address is selected by jumper; it decodes 24 address lines 
and 32 data lines on the VMXbus. 


MC68030 production promised for fall 1987 


Motorola’s Microprocessor Products 
Group has announced the MC68030, a 
second-generation 32-bit microprocessor 
unit. According to the company, the 
enhanced 16.67-MHz MPU offers twice 
the performance of its MC68020 and 
maintains upward software code com¬ 
patibility with the M68000 family MPUs. 

Performance improvements include in¬ 
creased internal parallelism, dual on-chip 
caches with a burst fillable mode, dual 
internal data and address buses, im¬ 
proved bus interface, and an on-chip 
paged memory management unit. The 
Harvard-style architecture provides the 
processor with an internal bus bandwidth 
of more than 80M bytes/s. The on-chip 
memory management unit reduces the 
minimum physical bus cycle time to two 
clocks, one half the time required by the 
MC68020 and MC688851. 

Motorola’s high-density MC68030 
contains about 300,000 transistors and is 
sized on a side at about 378 mils. The 


chip is enclosed in a 128-lead PGA pack¬ 
age. Sampling is planned for July, pro¬ 
duction for October, and a VMEbus- 
microcomputer version for fourth 
quarter 1987. Pricing was not released. 


Motorola Inc., Microprocessor Pro¬ 
ducts Group, PO Box 3600, Austin, TX 
78764; (512) 440-2839. 
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Block diagram of Motorola’s MC68030 MPU. 
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Three buses speed CMOS 32-bit multiplier 


Second-generation 32-bit 
FPC from Motorola 

Motorola Microprocessor Products 
Group has announced a second-genera¬ 
tion, 32-bit floating-point coprocessor, 
which is expected to offer two to four 
times the performance of the MC68881. 
The MC68882 enhanced FPC conforms 
to IEEE 754, the standard for binary 
floating-point arithmetic. It offers add, 
subtract, multiply, divide, and tran¬ 
scendental and non-transcendental 
functions. 

The HCMOS VLSI device is designed 
to operate primarily as a coprocessor 
with the MC68020 and MC68030 32-bit 
MPUs through a transparent MC68000 
coprocessor interface. In addition, the 
FPC can be used with M68000-family 
MPU devices and as a peripheral to non- 
M68000 processors. 

The 16.67-MHz MC68882 FPC is en¬ 
closed in a 68-lead PGA package; it is 
expected to be available for sampling in 
April 1987 with production planned for 
August. 

Motorola, Inc., Microprocessor Prod¬ 
ucts Group, PO Box 3600, Austin, TX 
78764; (512) 440-2839. 
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Intel boards combine 
Multibus, 80386 features 

Four single-board computers from In¬ 
tel Corporation use a 16-MHz, 80386 
32-bit microprocessor and a dual-bus 
structure to provide high-end processing 
power for intricate applications. The 
iSBC 386/21, /22, /24, and /28 com¬ 
puters are supported by iRMS 286, 
Xenix, Unix System V, and any pro¬ 
prietary operating system written for the 
8086 or 80286 CPU. 

The boards provide up to 8M bytes of 
32-bit memory, which can be expanded 
to 16M bytes with add-on surface-mount 
modules. The increased memory pro¬ 
vides users with direct CPU access to 
memory through a 64K-byte zero-wait- 
state cache memory without having to go 
out over the system bus. 

List prices are $4800 for the 386/21, 
$5970 for the /22, $8310 for the /24, and 
$12,990 for the /28. 

Intel Corporation, 3065 Bowers 
Avenue, PO Box 58065, Santa Clara, 

CA 95052-8065; (503) 640-7399. 
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Advanced Micro Devices’ Am29C323 
is a 125-ns CMOS 32 x 32-bit parallel 
multiplier. The first member in a 
planned CMOS family of 32-bit micro- 
programmable building blocks, the 
Am29C323 uses less than one watt of 
power while operating at 8MHz. 

The device’s three buses contain two 
32-bit input and one 32-bit output ports. 
It provides individual register feed¬ 
through controls, byte-parity checking 
on both input ports, and parity genera¬ 
tion on the output port. Dual-precision 
registers on each data input port support 
multiprecision multiplication. A 64-bit 
product and a 3-bit overflow product 
permit the accumulation of values larger 
than the normal accumulator width. 

During 1987 the company expects to 
introduce additional family products 
such as a 32-bit floating-point processor, 
16-bit microprogram sequencer, 32-bit 
extended-function ALU, and 64 x 18 
dual-access register file. The 125-ns 
Am29C323 is in production now; 100-ns 
and 80-ns versions are planned. 


The 168-pin PGA-packaged multiplier 
costs $245 in quantities of 100. 

Advanced Micro Devices, 901 Thomp¬ 
son Place, PO Box 3453, Sunnyvale, CA 
94088; (408) 982-7448. 
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Coming in IEEE Micro 

February: More on operating systems, multiprocessing, and digital 
signal processors 

February’s articles supplement our August, October, and December special 
issues. Included are an examination of performance models for Unix-based 
network file systems, a description of the Heidelberg Polyp system—a fault- 
tolerant multi-microprocessor, and an analysis of alternatives for implementing 
the discrete Fourier transform in signal processing applications. 
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the Reader Service Card (top). Circle the number on the RS Card that 
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To indicate your interest in an article or department, fill out the Reader Interest 
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be sent to Marie English, Managing Editor, IEEE Micro, 
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1987 EDITORIAL CALENDAR 


IEEE Micro -a bimonthly publication of the Computer Society of the IEEE-focuses on helping the desipers and users of 
microprocessor and microcomputer systems explore, produce, evaluate, and apply the latest technologies so that business and research ob¬ 
jectives can be achieved. 

Feature articles in IEEE Micro are original works relating to the desip, performance, or application of microprocessors and 
microcomputers. Tutorial material, industry views, and discussions of standards are often selected for publication. All manuscripts are 
subject to a peer-review process consistent with most professional-level technical publications. This review may take up to four months. 

AD CLOSING DATE: 1st of month preceding issue (Jan. 1st for February issue) 


pEBRUARY 


JpRIL 


June 

DIGITAL SIGNAL PROCESSING, 
OPERATING SYSTEMS, 
MULTIPROCESSING 

Additional articles supplementing 
the August, October, and December 
1986 issues. Titles include the 
80386 plus Unix, performance 
analysis of Unix-based network file 
systems, DFT implementations, and 
the Heidelberg Polyprocessor 
system. 


JAPANESE SPECIAL ISSUE: TR0N 
32-BIT MICROPROCESSORS 

IEEE Micro editorial board 
member and TR0N architect/ 
systems designer Ken Sakamura 
from the University of Tokyo offers a 
fine collection of articles about 
Japan’s newest offering, The Real¬ 
time Operating System Nucleus. 


NEW DEVELOPMENTS IN 
MICROPROCESSORS 

Explore the latest in design 
technologies and applications with 
this issue. Coverage of the Fairchild 
Clipper, the Intel 80387 copro¬ 
cessor, and other innovative chips is 
planned. 

Deadline for articles: January 1, 
1987 

J UGUST 


Qctober 


JJecember 

SYMBOLIC PROCESSORS AND 
SYSTEMS 

Articles will discuss the architec¬ 
ture, performance, and application 
features of specialized microproces¬ 
sors that incorporate Al languages in 
hardware. 

Deadline for articles: March 1,1987 


EUROPEAN SPECIAL ISSUE 

Catch up with the industry’s 
newest technologies from Europe. 
Guest editor is Karl E. Grosspietsch, 
scientist at the German national 
research institute for mathematics 
and data processing, the Gesell- 
schaft fuer Mathematik und Daten- 
verarbeitung, in St. Augustin, West 
Germany. 

Deadline for articles: April 1,1987 


THE NEW TECHNOLOGIES 

Read the latest information con¬ 
cerning subjects such as GaAs and 
one-micrometer technologies and 
high degrees of silicon integration. 

Deadline for articles: June 1, 1987 


Articles may change. Please contact the editors to confirm. 


HOWTO SUBMIT AN ARTICLE TO IEEE MICRO 

Prospective contributors should submit their manuscripts directly to: 

James J. Farrell III 
Editor-in-Chief, IEEE Micro 
VLSI Technology Incorporated 
10220 South 51st Street 
Phoenix, AZ 85044 
(602) 893-8574 

Successful contributions will be original works with sufficient introductory material and at least 20 percent of the total length devoted to 
tutorial material. The tutorial section will describe the principles or techniques of existing approaches and evaluate their advantages and 
disadvantages. Furthermore, the contributions will describe the practical or potential applications of the material presented. To improve 
readability, the discussion will be augmented with examples, tables, diagrams, charts, and photographs. 











































